High Performance Cluster Computing: Architectures and Systems

High Performance Cluster Computing: Architectures and Systems
Book Editor: Rajkumar Buyya Slides Prepared by: Hai Jin Internet and Cluster Computing Center

Cluster Computing at a Glance Chapter 1: by M. Baker and R. Buyya
Introduction Scalable Parallel Computer Architecture Towards Low Cost Parallel Computing and Motivations Windows of Opportunity A Cluster Computer and its Architecture Clusters Classifications Commodity Components for Clusters Network Service/Communications SW Cluster Middleware and Single System Image Resource Management and Scheduling (RMS) Programming Environments and Tools Cluster Applications Representative Cluster Systems Cluster of SMPs (CLUMPS) Summary and Conclusions

Introduction Need more computing power
Improve the operating speed of processors & other components constrained by the speed of light, thermodynamic laws, & the high financial costs for processor fabrication Connect multiple processors together & coordinate their computational efforts parallel computers allow the sharing of a computational task among multiple processors

How to Run Applications Faster ?
There are 3 ways to improve performance: Work Harder Work Smarter Get Help Computer Analogy Using faster hardware Optimized algorithms and techniques used to solve computational tasks Multiple computers to solve a particular task

Era of Computing Rapid technical advances Parallel computing
the recent advances in VLSI technology software technology OS, PL, development methodologies, & tools grand challenge applications have become the main driving force Parallel computing one of the best ways to overcome the speed bottleneck of a single processor good price/performance ratio of a small cluster-based parallel computer

Two Eras of Computing Architectures System Software/Compiler
Applications P.S.Es System Software Sequential Era Parallel Era Commercialization R & D Commodity

Scalable Parallel Computer Architectures
Taxonomy based on how processors, memory & interconnect are laid out Massively Parallel Processors (MPP) Symmetric Multiprocessors (SMP) Cache-Coherent Nonuniform Memory Access (CC-NUMA) Distributed Systems Clusters

MPP A large parallel processing system with a shared-nothing architecture Consist of several hundred nodes with a high-speed interconnection network/switch Each node consists of a main memory & one or more processors Runs a separate copy of the OS SMP 2-64 processors today Shared-everything architecture All processors share all the global resources available Single copy of the OS runs on these systems

CC-NUMA a scalable multiprocessor system having a cache-coherent nonuniform memory access architecture every processor has a global view of all of the memory Distributed systems considered conventional networks of independent computers have multiple system images as each node runs its own OS the individual machines could be combinations of MPPs, SMPs, clusters, & individual computers Clusters a collection of workstations of PCs that are interconnected by a high-speed network work as an integrated collection of resources have a single system image spanning all its nodes

Key Characteristics of Scalable Parallel Computers

Towards Low Cost Parallel Computing
Parallel processing linking together 2 or more computers to jointly solve some computational problem since the early 1990s, an increasing trend to move away from expensive and specialized proprietary parallel supercomputers towards networks of workstations the rapid improvement in the availability of commodity high performance components for workstations and networks  Low-cost commodity supercomputing from specialized traditional supercomputing platforms to cheaper, general purpose systems consisting of loosely coupled components built up from single or multiprocessor PCs or workstations need to standardization of many of the tools and utilities used by parallel applications (ex) MPI, HPF

Motivations of using NOW over Specialized Parallel Computers
Individual workstations are becoming increasing powerful Communication bandwidth between workstations is increasing and latency is decreasing Workstation clusters are easier to integrate into existing networks Typical low user utilization of personal workstations Development tools for workstations are more mature Workstation clusters are a cheap and readily available Clusters can be easily grown

Trend Workstations with UNIX for science & industry vs PC-based machines for administrative work & work processing A rapid convergence in processor performance and kernel-level functionality of UNIX workstations and PC-based machines

Windows of Opportunities
Parallel Processing Use multiple processors to build MPP/DSM-like systems for parallel computing Network RAM Use memory associated with each workstation as aggregate DRAM cache Software RAID Redundant array of inexpensive disks Use the arrays of workstation disks to provide cheap, highly available, & scalable file storage Possible to provide parallel I/O support to applications Use arrays of workstation disks to provide cheap, highly available, and scalable file storage Multipath Communication Use multiple networks for parallel data transfer between nodes

Cluster Computer and its Architecture
A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone computers cooperatively working together as a single, integrated computing resource A node a single or multiprocessor system with memory, I/O facilities, & OS generally 2 or more computers (nodes) connected together in a single cabinet, or physically separated & connected via a LAN appear as a single system to users and applications provide a cost-effective way to gain features and benefits

Cluster Computer Architecture
Parallel Applications Parallel Applications Parallel Applications Sequential Applications Sequential Applications Sequential Applications Parallel Programming Environment Cluster Middleware (Single System Image and Availability Infrastructure) PC/Workstation Network Interface Hardware Communications Software PC/Workstation Network Interface Hardware Communications Software PC/Workstation Network Interface Hardware Communications Software PC/Workstation Network Interface Hardware Communications Software Cluster Interconnection Network/Switch

Prominent Components of Cluster Computers (I)
Multiple High Performance Computers PCs Workstations SMPs (CLUMPS) Distributed HPC Systems leading to Metacomputing

Prominent Components of Cluster Computers (II)
State of the art Operating Systems Linux (MOSIX, Beowulf, and many more) Microsoft NT (Illinois HPVM, Cornell Velocity) SUN Solaris (Berkeley NOW, C-DAC PARAM) IBM AIX (IBM SP2) HP UX (Illinois - PANDA) Mach (Microkernel based OS) (CMU) Cluster Operating Systems (Solaris MC, SCO Unixware, MOSIX (academic project) OS gluing layers (Berkeley Glunix)

Prominent Components of Cluster Computers (III)
High Performance Networks/Switches Ethernet (10Mbps), Fast Ethernet (100Mbps), Gigabit Ethernet (1Gbps) SCI (Scalable Coherent Interface- MPI- 12µsec latency) ATM (Asynchronous Transfer Mode) Myrinet (1.2Gbps) QsNet (Quadrics Supercomputing World, 5µsec latency for MPI messages) Digital Memory Channel FDDI (fiber distributed data interface) InfiniBand

Prominent Components of Cluster Computers (IV)
Network Interface Card Myrinet has NIC User-level access support

Prominent Components of Cluster Computers (V)
Fast Communication Protocols and Services Active Messages (Berkeley) Fast Messages (Illinois) U-net (Cornell) XTP (Virginia) Virtual Interface Architecture (VIA)

Comparison $1.5K QSnet SCI 208 ~105 165 ~80 30 - 50 5 ~20 - 40 20.2 6
Myrinet QSnet Giganet ServerNet2 SCI Gigabit Ethernet Bandwidth (MBytes/s) 140 – 33MHz 215 – 66 Mhz 208 ~105 165 ~80 MPI Latency (µs) 16.5 – 33Nhz 11 – 66 Mhz 5 ~ 20.2 6 List price/port $1.5K $6.5K ~$1.5K Hardware Availability Now Q2‘00 Linux Support Late‘00 Maximum #nodes 1000’s 64K Protocol Implementation Firmware on adapter Firmware on adapter Implemented in hardware Implemented in hardware VIA support Soon None NT/Linux Done in hardware Software TCP/IP, VIA MPI support 3rd party Quadrics/ Compaq 3rd Party Compaq/3rd party MPICH – TCP/IP $ figures as per summer (June) 00 Myrinet GM-MPI 215 MBytes/s – 11us latency - It was with a Lanai9 at 132 MHz available now.

Prominent Components of Cluster Computers (VI)
Cluster Middleware Single System Image (SSI) System Availability (SA) Infrastructure Hardware DEC Memory Channel, DSM (Alewife, DASH), SMP Techniques Operating System Kernel/Gluing Layers Solaris MC, Unixware, GLUnix Applications and Subsystems Applications (system management and electronic forms) Runtime systems (software DSM, PFS etc.) Resource management and scheduling software (RMS) CODINE, LSF, PBS, Libra: Economy Cluster Scheduler, NQS, etc.

Prominent Components of Cluster Computers (VII)
Parallel Programming Environments and Tools Threads (PCs, SMPs, NOW..) POSIX Threads Java Threads MPI Linux, NT, on many Supercomputers PVM Software DSMs (Shmem) Compilers C/C++/Java Parallel programming with C++ (MIT Press book) RAD (rapid application development tools) GUI based tools for PP modeling Debuggers Performance Analysis Tools Visualization Tools

Prominent Components of Cluster Computers (VIII)
Applications Sequential Parallel / Distributed (Cluster-aware app.) Grand Challenging applications Weather Forecasting Quantum Chemistry Molecular Biology Modeling Engineering Analysis (CAD/CAM) ………………. PDBs, web servers,data-mining

Key Operational Benefits of Clustering
High Performance Expandability and Scalability High Throughput High Availability

Clusters Classification (I)
Application Target High Performance (HP) Clusters Grand Challenging Applications High Availability (HA) Clusters Mission Critical applications

Clusters Classification (II)
Node Ownership Dedicated Clusters Non-dedicated clusters Adaptive parallel computing Communal multiprocessing

Clusters Classification (III)
Node Hardware Clusters of PCs (CoPs) Piles of PCs (PoPs) Clusters of Workstations (COWs) Clusters of SMPs (CLUMPs)

Clusters Classification (IV)
Node Operating System Linux Clusters (e.g., Beowulf) Solaris Clusters (e.g., Berkeley NOW) NT Clusters (e.g., HPVM) AIX Clusters (e.g., IBM SP2) SCO/Compaq Clusters (Unixware) Digital VMS Clusters HP-UX clusters Microsoft Wolfpack clusters

Clusters Classification (V)
Node Configuration Homogeneous Clusters All nodes will have similar architectures and run the same OSs Heterogeneous Clusters All nodes will have different architectures and run different OSs

Clusters Classification (VI)
Levels of Clustering Group Clusters (#nodes: 2-99) Nodes are connected by SAN like Myrinet Departmental Clusters (#nodes: 10s to 100s) Organizational Clusters (#nodes: many 100s) National Metacomputers (WAN/Internet-based) International Metacomputers (Internet-based, #nodes: 1000s to many millions) Metacomputing / Grid Computing Web-based Computing Agent Based Computing Java plays a major in web and agent based computing

Commodity Components for Clusters (I)
Processors Intel x86 Processors Pentium Pro and Pentium Xeon AMD x86, Cyrix x86, etc. Digital Alpha Alpha processor integrates processing, memory controller, network interface into a single chip IBM PowerPC Sun SPARC SGI MIPS HP PA Berkeley Intelligent RAM (IRAM) integrates processor and DRAM onto a single chip

Commodity Components for Clusters (II)
Memory and Cache Standard Industry Memory Module (SIMM) Extended Data Out (EDO) Allow next access to begin while the previous data is still being read Fast page Allow multiple adjacent accesses to be made more efficiently Access to DRAM is extremely slow compared to the speed of the processor the very fast memory used for Cache is expensive & cache control circuitry becomes more complex as the size of the cache grows Within Pentium-based machines, uncommon to have a 64-bit wide memory bus as well as a chip set that support 2Mbytes of external cache

Commodity Components for Clusters (III)
Disk and I/O Overall improvement in disk access time has been less than 10% per year Amdahl’s law Speed-up obtained by from faster processors is limited by the slowest system component Parallel I/O Carry out I/O operations in parallel, supported by parallel file system based on hardware or software RAID

Commodity Components for Clusters (IV)
System Bus ISA bus (AT bus) Clocked at 5MHz and 8 bits wide Clocked at 13MHz and 16 bits wide VESA bus 32 bits bus matched system’s clock speed PCI bus 133Mbytes/s transfer rate Adopted both in Pentium-based PC and non-Intel platform (e.g., Digital Alpha Server)

Commodity Components for Clusters (V)
Cluster Interconnects Communicate over high-speed networks using a standard networking protocol such as TCP/IP or a low-level protocol such as AM Standard Ethernet 10 Mbps cheap, easy way to provide file and printer sharing bandwidth & latency are not balanced with the computational power Ethernet, Fast Ethernet, and Gigabit Ethernet Fast Ethernet – 100 Mbps Gigabit Ethernet preserve Ethernet’s simplicity deliver a very high bandwidth to aggregate multiple Fast Ethernet segments

Commodity Components for Clusters (VI)
Cluster Interconnects Asynchronous Transfer Mode (ATM) Switched virtual-circuit technology Cell (small fixed-size data packet) use optical fiber - expensive upgrade telephone style cables (CAT-3) & better quality cable (CAT-5) Scalable Coherent Interfaces (SCI) IEEE standard aimed at providing a low-latency distributed shared memory across a cluster Point-to-point architecture with directory-based cache coherence reduce the delay interprocessor communication eliminate the need for runtime layers of software protocol-paradigm translation less than 12 usec zero message-length latency on Sun SPARC Designed to support distributed multiprocessing with high bandwidth and low latency SCI cards for SPARC’s SBus and PCI-based SCI cards from Dolphin Scalability constrained by the current generation of switches & relatively expensive components

Commodity Components for Clusters (VII)
Cluster Interconnects Myrinet 1.28 Gbps full duplex interconnection network Use low latency cut-through routing switches, which is able to offer fault tolerance by automatic mapping of the network configuration Support both Linux & NT Advantages Very low latency (5s, one-way point-to-point) Very high throughput Programmable on-board processor for greater flexibility Disadvantages Expensive: $1500 per host Complicated scaling: switches with more than 16 ports are unavailable

Commodity Components for Clusters (VIII)
Operating Systems 2 fundamental services for users make the computer hardware easier to use create a virtual machine that differs markedly from the real machine share hardware resources among users Processor - multitasking The new concept in OS services support multiple threads of control in a process itself parallelism within a process multithreading POSIX thread interface is a standard programming environment Trend Modularity – MS Windows, IBM OS/2 Microkernel – provide only essential OS services high level abstraction of OS portability

Commodity Components for Clusters (IX)
Operating Systems Linux UNIX-like OS Runs on cheap x86 platform, yet offers the power and flexibility of UNIX Readily available on the Internet and can be downloaded without cost Easy to fix bugs and improve system performance Users can develop or fine-tune hardware drivers which can easily be made available to other users Features such as preemptive multitasking, demand-page virtual memory, multiuser, multiprocessor support

Commodity Components for Clusters (X)
Operating Systems Solaris UNIX-based multithreading and multiuser OS support Intel x86 & SPARC-based platforms Real-time scheduling feature critical for multimedia applications Support two kinds of threads Light Weight Processes (LWPs) User level thread Support both BSD and several non-BSD file system CacheFS AutoClient TmpFS: uses main memory to contain a file system Proc file system Volume file system Support distributed computing & is able to store & retrieve distributed information OpenWindows allows application to be run on remote systems

Commodity Components for Clusters (XI)
Operating Systems Microsoft Windows NT (New Technology) Preemptive, multitasking, multiuser, 32-bits OS Object-based security model and special file system (NTFS) that allows permissions to be set on a file and directory basis Support multiple CPUs and provide multitasking using symmetrical multiprocessing Support different CPUs and multiprocessor machines with threads Have the network protocols & services integrated with the base OS several built-in networking protocols (IPX/SPX., TCP/IP, NetBEUI), & APIs (NetBIOS, DCE RPC, Window Sockets (Winsock))

Windows NT 4.0 Architecture

Network Services/ Communication SW
Communication infrastructure support protocol for Bulk-data transport Streaming data Group communications Communication service provide cluster with important QoS parameters Latency Bandwidth Reliability Fault-tolerance Jitter control Network service are designed as hierarchical stack of protocols with relatively low-level communication API, provide means to implement wide range of communication methodologies RPC DSM Stream-based and message passing interface (e.g., MPI, PVM)

Single System Image

What is Single System Image (SSI) ?
A single system image is the illusion, created by software or hardware, that presents a collection of resources as one, more powerful resource. SSI makes the cluster appear like a single machine to the user, to applications, and to the network. A cluster without a SSI is not a cluster

Cluster Middleware & SSI
Supported by a middleware layer that resides between the OS and user-level environment Middleware consists of essentially 2 sublayers of SW infrastructure SSI infrastructure Glue together OSs on all nodes to offer unified access to system resources System availability infrastructure Enable cluster services such as checkpointing, automatic failover, recovery from failure, & fault-tolerant support among all nodes of the cluster

Single System Image Boundaries
Every SSI has a boundary SSI support can exist at different levels within a system, one able to be build on another

SSI Boundaries -- an applications SSI boundary
Batch System SSI Boundary (c) In search of clusters

SSI Levels/Layers Application and Subsystem Level
Operating System Kernel Level Hardware Level

SSI at Hardware Layer Application and Subsystem Level
Examples Boundary Importance Application and Subsystem Level Operating System Kernel Level SCI, DASH memory memory space better communication and synchronization SCI, SMP techniques memory and I/O device space lower overhead cluster I/O memory and I/O (c) In search of clusters

SSI at Operating System Kernel (Underware) or Gluing Layer
Level Examples Boundary Importance Kernel/ OS Layer Solaris MC, Unixware MOSIX, Sprite,Amoeba / GLUnix kernel interfaces virtual memory UNIX (Sun) vnode, Locus (IBM) vproc each name space: files, processes, pipes, devices, etc. kernel support for applications, adm subsystems none supporting operating system kernel type of kernel objects: files, processes, etc. modularizes SSI code within may simplify implementation of kernel objects each distributed virtual memory space microkernel Mach, PARAS, Chorus, OSF/1AD, Amoeba implicit SSI for all system services each service outside the (c) In search of clusters

SSI at Application and Subsystem Layer (Middleware)
Level Examples Boundary Importance application cluster batch system, system management subsystem file system distributed DB, OSF DME, Lotus Notes, MPI, PVM an application what a user wants Sun NFS, OSF, DFS, NetWare, and so on a subsystem SSI for all applications of the subsystem implicitly supports many applications and subsystems shared portion of the file system toolkit OSF DCE, Sun ONC+, Apollo Domain best level of support for heterogeneous system explicit toolkit facilities: user, service name,time (c) In search of clusters

Single System Image Benefits
Provide a simple, straightforward view of all system resources and activities, from any node of the cluster Free the end user from having to know where an application will run Free the operator from having to know where a resource is located Let the user work with familiar interface and commands and allows the administrators to manage the entire clusters as a single entity Reduce the risk of operator errors, with the result that end users see improved reliability and higher availability of the system

Single System Image Benefits (Cont’d)
Allowing centralize/decentralize system management and control to avoid the need of skilled administrators from system administration Present multiple, cooperating components of an application to the administrator as a single application Greatly simplify system management Provide location-independent message communication Help track the locations of all resource so that there is no longer any need for system operators to be concerned with their physical location Provide transparent process migration and load balancing across nodes. Improved system response time and performance

Middleware Design Goals
Complete Transparency in Resource Management Allow user to use a cluster easily without the knowledge of the underlying system architecture The user is provided with the view of a globalized file system, processes, and network Scalable Performance Can easily be expanded, their performance should scale as well To extract the max performance, the SSI service must support load balancing & parallelism by distributing workload evenly among nodes Enhanced Availability Middleware service must be highly available at all times At any time, a point of failure should be recoverable without affecting a user’s application Employ checkpointing & fault tolerant technologies Handle consistency of data when replicated

SSI Support Services Single Entry Point
telnet cluster.myinstitute.edu telnet node1.cluster. myinstitute.edu Single File Hierarchy: xFS, AFS, Solaris MC Proxy Single Management and Control Point: Management from single GUI Single Virtual Networking Single Memory Space - Network RAM / DSM Single Job Management: GLUnix, Codine, LSF Single User Interface: Like workstation/PC windowing environment (CDE in Solaris/NT), may it can use Web technology

Availability Support Functions
Single I/O Space (SIOS): any node can access any peripheral or disk devices without the knowledge of physical location. Single Process Space (SPS) Any process on any node create process with cluster wide process wide and they communicate through signal, pipes, etc, as if they are one a single node. Checkpointing and Process Migration. Saves the process state and intermediate results in memory to disk to support rollback recovery when node fails PM for dynamic load balancing among the cluster nodes

Resource Management and Scheduling

Resource Management and Scheduling (RMS)
RMS is the act of distributing applications among computers to maximize their throughput Enable the effective and efficient utilization of the resources available Software components Resource manager Locating and allocating computational resource, authentication, process creation and migration Resource scheduler Queuing applications, resource location and assignment. It instructs resource manager what to do when (policy) Reasons for using RMS Provide an increased, and reliable, throughput of user applications on the systems Load balancing Utilizing spare CPU cycles Providing fault tolerant systems Manage access to powerful system, etc Basic architecture of RMS: client-server system

RMS Components

Libra: An example cluster scheduler
User Cluster Management System (PBS) Job Input Control Application Scheduler (Libra) Budget Check Deadline Control Server: Master Node Node Querying Module Best Node Evaluator Job Dispatch Control Cluster Worker Nodes with node monitor (pbs_mom) (node 1) (node 2) (node N) .

Services provided by RMS
Process Migration Computational resource has become too heavily loaded Fault tolerant concern Checkpointing Scavenging Idle Cycles 70% to 90% of the time most workstations are idle Fault Tolerance Minimization of Impact on Users Load Balancing Multiple Application Queues

Some Popular Resource Management Systems
Project Commercial Systems - URL LSF SGE Easy-LL NQE Public Domain System - URL Condor GNQS DQS PBS Libra or

Cluster Programming

Cluster Programming Environments
Shared Memory Based DSM Threads/OpenMP (enabled for clusters) Java threads (HKU JESSICA, IBM cJVM) Message Passing Based PVM (PVM) MPI (MPI) Parametric Computations Nimrod-G Automatic Parallelising Compilers Parallel Libraries & Computational Kernels (e.g., NetSolve) 14

Levels of Parallelism PVM/MPI Threads Compilers CPU Task i-l Task i
Code-Granularity Code Item Large grain (task level) Program Medium grain (control level) Function (thread) Fine grain (data level) Loop (Compiler) Very fine grain (multiple issue) With hardware PVM/MPI Task i-l Task i Task i+1 func1 ( ) { .... } func2 ( ) { .... } func3 ( ) { .... } Threads Compilers a ( 0 ) =.. b ( 0 ) =.. a ( 1 )=.. b ( 1 )=.. a ( 2 )=.. b ( 2 )=.. CPU + x Load

Programming Environments and Tools (I)
Threads (PCs, SMPs, NOW..) In multiprocessor systems Used to simultaneously utilize all the available processors In uniprocessor systems Used to utilize the system resources effectively Multithreaded applications offer quicker response to user input and run faster Potentially portable, as there exists an IEEE standard for POSIX threads interface (pthreads) Extensively used in developing both application and system software

Programming Environments and Tools (II)
Message Passing Systems (MPI and PVM) Allow efficient parallel programs to be written for distributed memory systems 2 most popular high-level message-passing systems – PVM & MPI PVM both an environment & a message-passing library MPI a message passing specification, designed to be standard for distributed memory parallel computing using explicit message passing attempt to establish a practical, portable, efficient, & flexible standard for message passing generally, application developers prefer MPI, as it is fast becoming the de facto standard for message passing

Programming Environments and Tools (III)
Distributed Shared Memory (DSM) Systems Message-passing the most efficient, widely used, programming paradigm on distributed memory system complex & difficult to program Shared memory systems offer a simple and general programming model but suffer from scalability DSM on distributed memory system alternative cost-effective solution Software DSM Usually built as a separate layer on top of the comm interface Take full advantage of the application characteristics: virtual pages, objects, & language types are units of sharing TreadMarks, Linda Hardware DSM Better performance, no burden on user & SW layers, fine granularity of sharing, extensions of the cache coherence scheme, & increased HW complexity DASH, Merlin

Programming Environments and Tools (IV)
Parallel Debuggers and Profilers Debuggers Very limited HPDF (High Performance Debugging Forum) as Parallel Tools Consortium project in 1996 Developed a HPD version specification, which defines the functionality, semantics, and syntax for a commercial-line parallel debugger TotalView A commercial product from Dolphin Interconnect Solutions The only widely available GUI-based parallel debugger that supports multiple HPC platforms Only used in homogeneous environments, where each process of the parallel application being debugged must be running under the same version of the OS

Functionality of Parallel Debugger
Managing multiple processes and multiple threads within a process Displaying each process in its own window Displaying source code, stack trace, and stack frame for one or more processes Diving into objects, subroutines, and functions Setting both source-level and machine-level breakpoints Sharing breakpoints between groups of processes Defining watch and evaluation points Displaying arrays and its slices Manipulating code variable and constants

Programming Environments and Tools (V)
Performance Analysis Tools Help a programmer to understand the performance characteristics of an application Analyze & locate parts of an application that exhibit poor performance and create program bottlenecks Major components A means of inserting instrumentation calls to the performance monitoring routines into the user’s applications A run-time performance library that consists of a set of monitoring routines A set of tools for processing and displaying the performance data Issue with performance monitoring tools Intrusiveness of the tracing calls and their impact on the application performance Instrumentation affects the performance characteristics of the parallel application and thus provides a false view of its performance behavior

Performance Analysis and Visualization Tools
Supports URL AIMS Instrumentation, monitoring library, analysis MPE Logging library and snapshot performance visualization Pablo Monitoring library and analysis Paradyn Dynamic instrumentation running analysis SvPablo Integrated instrumentor, monitoring library and analysis Vampir Monitoring library performance visualization Dimenmas Performance prediction for message passing programs Paraver Program visualization and analysis

Programming Environments and Tools (VI)
Cluster Administration Tools Berkeley NOW Gather & store data in a relational DB Use Java applet to allow users to monitor a system SMILE (Scalable Multicomputer Implementation using Low-cost Equipment) Called K-CAP Consist of compute nodes, a management node, & a client that can control and monitor the cluster K-CAP uses a Java applet to connect to the management node through a predefined URL address in the cluster PARMON A comprehensive environment for monitoring large clusters Use client-server techniques to provide transparent access to all nodes to be monitored parmon-server & parmon-client

Need of more Computing Power: Grand Challenge Applications
Solving technology problems using computer modeling, simulation and analysis Geographic Information Systems Aerospace Life Sciences CAD/CAM Digital Biology Military Applications

Case Studies of Some Cluster Systems

Representative Cluster Systems (I)
The Berkeley Network of Workstations (NOW) Project Demonstrate building of a large-scale parallel computer system using mass produced commercial workstations & the latest commodity switch-based network components Interprocess communication Active Messages (AM) basic communication primitives in Berkeley NOW A simplified remote procedure call that can be implemented efficiently on a wide range of hardware Global Layer Unix (GLUnix) An OS layer designed to provide transparent remote execution, support for interactive parallel & sequential jobs, load balancing, & backward compatibility for existing application binaries Aim to provide a cluster-wide namespace and uses Network PIDs (NPIDs), and Virtual Node Numbers (VNNs)

Architecture of NOW System

Representative Cluster Systems (II)
The Berkeley Network of Workstations (NOW) Project Network RAM Allow to utilize free resources on idle machines as a paging device for busy machines Serverless any machine can be a server when it is idle, or a client when it needs more memory than physically available xFS: Serverless Network File System A serverless, distributed file system, which attempt to have low latency, high bandwidth access to file system data by distributing the functionality of the server among the clients The function of locating data in xFS is distributed by having each client responsible for servicing requests on a subset of the files File data is striped across multiple clients to provide high bandwidth

Representative Cluster Systems (III)
The High Performance Virtual Machine (HPVM) Project Deliver supercomputer performance on a low cost COTS system Hide the complexities of a distributed system behind a clean interface Challenges addressed by HPVM Delivering high performance communication to standard, high-level APIs Coordinating scheduling and resource management Managing heterogeneity

HPVM Layered Architecture

Representative Cluster Systems (IV)
The High Performance Virtual Machine (HPVM) Project Fast Messages (FM) A high bandwidth & low-latency comm protocol, based on Berkeley AM Contains functions for sending long and short messages & for extracting messages from the network Guarantees and controls the memory hierarchy Guarantees reliable and ordered packet delivery as well as control over the scheduling of communication work Originally developed on a Cray T3D & a cluster of SPARCstations connected by Myrinet hardware Low-level software interface that delivery hardware communication performance High-level layers interface offer greater functionality, application portability, and ease of use

Representative Cluster Systems (V)
The Beowulf Project Investigate the potential of PC clusters for performing computational tasks Refer to a Pile-of-PCs (PoPC) to describe a loose ensemble or cluster of PCs Emphasize the use of mass-market commodity components, dedicated processors, and the use of a private communication network Achieve the best overall system cost/performance ratio for the cluster

Representative Cluster Systems (VI)
The Beowulf Project System Software Grendel the collection of software tools resource management & support distributed applications Communication through TCP/IP over Ethernet internal to cluster employ multiple Ethernet networks in parallel to satisfy the internal data transfer bandwidth required achieved by ‘channel binding’ techniques Extend the Linux kernel to allow a loose ensemble of nodes to participate in a number of global namespaces Two Global Process ID (GPID) schemes Independent of external libraries GPID-PVM compatible with PVM Task ID format & uses PVM as its signal transport

Representative Cluster Systems (VII)
Solaris MC: A High Performance Operating System for Clusters A distributed OS for a multicomputer, a cluster of computing nodes connected by a high-speed interconnect Provide a single system image, making the cluster appear like a single machine to the user, to applications, and the the network Built as a globalization layer on top of the existing Solaris kernel Interesting features extends existing Solaris OS preserves the existing Solaris ABI/API compliance provides support for high availability uses C++, IDL, CORBA in the kernel leverages spring technology

Solaris MC Architecture

Representative Cluster Systems (VIII)
Solaris MC: A High Performance Operating System for Clusters Use an object-oriented framework for communication between nodes Based on CORBA Provide remote object method invocations Provide object reference counting Support multiple object handlers Single system image features Global file system Distributed file system, called ProXy File System (PXFS), provides a globalized file system without need for modifying the existing file system Globalized process management Globalized network and I/O

Cluster System Comparison Matrix
Project Platform Communications OS Other Beowulf PCs Multiple Ethernet with TCP/IP Linux and Grendel MPI/PVM. Sockets and HPF Berkeley Now Solaris-based PCs and workstations Myrinet and Active Messages Solaris + GLUnix + xFS AM, PVM, MPI, HPF, Split-C HPVM Myrinet with Fast Messages NT or Linux connection and global resource manager + LSF Java-fronted, FM, Sockets, Global Arrays, SHEMEM and MPI Solaris MC Solaris-supported Solaris + Globalization layer C++ and CORBA

Cluster of SMPs (CLUMPS)
Clusters of multiprocessors (CLUMPS) To be the supercomputers of the future Multiple SMPs with several network interfaces can be connected using high performance networks 2 advantages Benefit from the high performance, easy-to-use-and program SMP systems with a small number of CPUs Clusters can be set up with moderate effort, resulting in easier administration and better support for data locality inside a node

Hardware and Software Trends
Network performance increase of tenfold using 100BaseT Ethernet with full duplex support The availability of switched network circuits, including full crossbar switches for proprietary network technologies such as Myrinet Workstation performance has improved significantly Improvement of microprocessor performance has led to the availability of desktop PCs with performance of low-end workstations at significant low cost Performance gap between supercomputer and commodity-based clusters is closing rapidly Parallel supercomputers are now equipped with COTS components, especially microprocessors Increasing usage of SMP nodes with two to four processors The average number of transistors on a chip is growing by about 40% per annum The clock frequency growth rate is about 30% per annum

Technology Trend

Advantages of using COTS-based Cluster Systems
Price/performance when compared with a dedicated parallel supercomputer Incremental growth that often matches yearly funding patterns The provision of a multipurpose system

Computing Platforms Evolution: Breaking Administrative Barriers
2100 ? PERFORMANCE 2100 Administrative Barriers Individual Group Department Campus State National Globe Inter Planet Universe Desktop (Single Processor) SMPs or SuperComputers Local Cluster Enterprise Cluster/Grid Global Cluster/Grid Inter Planet Cluster/Grid ??

High Performance Cluster Computing Architectures and Systems
Hai Jin Internet and Cluster Computing Center

Cluster Setup and its Administration
Introduction Setting up the Cluster Security System Monitoring System Tuning

Introduction (1) Affordable and reasonably efficient clusters seem to flourish everywhere High speed networks and processors start becoming commodity H/W More traditional clustered systems are steadily getting somewhat cheaper Cluster system is no longer too specific, too restricted access system New possibilities for researchers and new questions for system administrators

Introduction (2) Beowulf project is the most significant event in the cluster computing Cheap network, cheap node, Linux Cluster system Not just a pile of PC’s or workstation Getting some useful work done can be quite a slow and tedious task A group of RS/6000 is not an SP2 Several UltraSPARCs also can’t make an AP-3000

Introduction (3) There is a lot to do before a pile of PCs become a single, workable system Managing a cluster Facing requirement completely different from more conventional systems A lot of hard work and custom solutions

Setting up the Cluster Setup of Beowulf-class clusters
Before design the interconnection network or the computing nodes, we must define “The cluster purpose” with as much detail as possible

Starting from Scratch (1)
Interconnection Network Network technology Fast Ethernet, Myrinet, SCI, ATM Network topology Fast Ethernet (hub, switch) Some algorithms show very little performance degradation when changing from full port switching to segment switching, and cheap Direct point-to-point connection with crossed cabling Hypercube 16 or 32 nodes because of the number of interfaces in each node, the complexity of cabling and the routing (software side) Dynamic routing protocol More traffic and complexity OS support for bonding several physical interfaces into a single virtual one for higher throughput

Front-end Setup NFS Most cluster have one or several NFS server node NFS is not scalable or fast, but it works; user will want an easy way for their non I/O-intensive jobs to work on the whole cluster with the same name space Front-end Some distinguished node where human users log-in from the rest of the network Where they submit jobs to the rest of cluster

Advantage of using Front-end Users log in, compile and debugging, and submit jobs Keep the environment as similar to the node as possible Advanced IP routing capabilities: security improvements, load-balancing Provide ways to improve security, but makes administration much easier: single system Management: install/remove S/W, logs for problem, start/shutdown Global operations: running the same command, distributing commands on all or selected nodes

Two Cluster Configuration Systems

Node Setup How to install all of the nodes at a time? Network boot and automated remote installation Provided that all of nodes will have same configuration, the fastest way is usually to install a single node and then make clone How can one have access to the console of all nodes? Keyboard/monitor selector: not a real solution, and does not scale even for a middle size cluster Software console

Directory Services inside the Cluster
A cluster is supposed to keep a consistent image across all its nodes, such as same S/W, same configuration Need a single unified way to distribute the same configuration across the cluster

NIS vs. NIS+ NIS Sun Microsystems’ client-server protocol for distributing system configuration data such as user and host names between computers on a network Keeping a common user database Has no way of dynamically updating network routing information or any configuration changes to user-defined applications NIS+ Substantial improvement over NIS, is not so widely available, is a mess to administer, and still leaves much to be desired

LDAP vs. User Authentication
LDAP was defined by the IETF in order to encourage adoption of X.500 directories Directory Access Protocol (DAP) was seen as too complex for simple internet clients to use LDAP defines a relatively simple protocol for updating and searching directories running over TCP/IP User authentication Foolproof solution of copying the password file to each node As for other configuration tables, there are different solutions

DCE Integration Provides a highly scalable directory service, security service, a distributed file system, clock synchronization, threads, RPC Open standard but not available certain platforms Some of its services have already been surpassed by further developments DCE threads are based on early POSIX draft and there have been significant changes since then DCE servers tend to be rather expensive and complex DCE RPC has some important advantages over the Sun ONC RPC DFS is more secure and easier to replicate and cache effectively than NFS Can be more useful large campus-wide network Support replicated servers for read-only data

Global Clock Synchronization
Serialization needs global time failing to do so tend to produce subtle and difficult to track errors In order to implement a global time service DCE DTS (Distributed Time Service): better than NTP NTP (Network Time Protocol) Widely employed on thousands of hosts across the Internet and provides support for a variety of time resource Needs for a strict UTC synchronization Time servers GPS

Heterogeneous Clusters
Reasons for heterogeneous clusters Exploiting higher floating point performance of certain architectures and the low cost of other system, or for research purposes NOWs. Making use of idle hardware Heterogeneous means automation administration work will become more complex File system layouts converging but still far from coherent Software packaging different POSIX attempting standardization has little success Administration command are also different Solution Develop a per-architecture and per-OS set of wrappers with common external view Endian difference, world length difference

Some Experiences with PoPC Clusters
Borg: a 24 Linux node Cluster at LFCIA laboratory AMD K6 processor, 2 Fast Ethernet Front-end is dual PII with an additional network interface, act as a gateway to external workstations. Front-end monitoring the nodes with mon 24 Port 3Com SuperStack II 3300: managed by serial console, telnet, HTML client & RMON Switches - suitable point for monitoring, most of the management is done by the switch itself While simple and not expensive, this solution is giving good manageability, keeping the response time low and providing more than enough information when need

borg, the Linux Cluster at LFCIA

Monitoring the borg

Security Policies End users have to play an active role in keeping a secure environment The real need for security The reasons behind the security measure taken The way to use them properly Tradeoff between usability and security

Finding the Weakest Point in NOWs and COWs
Isolating services from each other is almost impossible While we all realize how potentially dangerous some services are, it is sometimes difficult to track how these are related with other seemingly innocent ones Allowing rsh access from the outside is bad Single intrusion implies a security compromises for all of them A service is not safe unless all of the services it depends on are at least equally safe

Weak Point due to the Intersection of Services

A Little Help from a Front-end
Human factor: destroying consistency Information leaks: TCP/IP Clusters are often used from external workstations in other networks Justify a front-end from a security viewpoint in most cases - serve as a simple firewall

Security versus Performance Tradeoffs
Most security measures have no impact on performance and proper planning can avoid that impact Tradeoffs More usability versus more security Better performance versus more security The case with strong ciphers

Clusters of Clusters Building clusters of clusters is common practice for large-scale testing. But special care must be taken on the security implications when this is done Building secure tunnels between the clusters, usually from front-end to front-end Unsafe network, high security requirements - a dedicated tunnel front-end or keeping the usual front-end free for just the tunneling Nearby clusters in the same backbone - letting the switches do the work VLAN: using trusted backbone switch

Intercluster Communication using a Secure Tunnel

VLAN using a Trusted Backbone Switch

System Monitoring It is vital to stay informed of any incidents that may cause unplanned downtime or intermittent problems Some problems that are trivially found in single system may be hidden for long time they are detected

Unsuitability of General Purpose Monitoring Tools
Main purpose - network monitoring, not the case with cluster This obviously is not the case with clusters. The network is just a system component, even if a critical one, but the sole subject of monitoring in itself In most cluster setups it is possible to install custom agents in the nodes track usage, load, and network traffic, tune OS, find I/O bottleneck, foresees possible problem, or balance future system purchase

Subjects of Monitoring (1)
Physical Environment Candidates for monitoring subject Temperature, humidity, supply voltage The functional status of moving parts (fans) Keep some environmental variables stable within reasonable value greatly help keeping the MTBF high

Logical Services Logical services is aimed at finding current problems when they are already impacting the system A low delay until the problem is detected and isolated must be a priority Find error or misconfiguration Logical services range Low level like raw network access and running processor High level like RPC and NFS services running, correct routing All monitoring tools provide some way of defining customized scripts for testing individual services Connecting to the telnet port of a server and receiving the “login” prompt is not enough to ensure that users can log in; bad NFS mounts could cause their login scripts to sleep forever

Performance Meters Performance meters tend to be completely application specific Code profiling => side effect time and cache Spy node => for network load-balancing Special care must be taken when tracing events that spawn several nodes It is very difficult to guarantee a good enough cluster wide synchronization

Self Diagnosis and Automatic Corrective Procedures
Taking corrective measures Making the system take these decisions itself Taking automatic preventive measures Most actions end up being “page the administrator” In order to take reasonable decisions, the system should know what sets of symptoms lead to suspect of what failures, and appropriate corrective procedures to take For any nontrivial service the graph of dependencies will be quite complex, and this kind of reasoning almost asks for an export system Any monitor performing automatic corrections should be at least based on rule-based system and not rely on direct alert-action relations

System Tuning Developing Custom Models for Bottleneck Detection
No tuning can be done without define goals Tuning a system can be seen as minimizing a cost function Higher throughput for job may not be help increases network No performance gain comes for free, and often means tradeoff Performance, safety, generality, interoperability

Focusing on Throughput or Focusing on Latency
Most UNIX systems tuned for high throughput Adequate for general timesharing system Cluster are frequently used as a large single user system, the main bottleneck is latency Network latency tends to be especially critical for most applications but H/W dependent Lightweight protocol do help somewhat, but with the current highly optimized IP stacks there is no longer a huge difference in most H/W Each node can be consider as just component of the whole cluster, and its tuning aimed at global performance

I/O Implications I/O subsystems as used in conventional servers are not always a good choice for cluster nodes Commodity off-the-shelf IDE disk drives are cheaper and faster and even have the advantage of a lower latency than most higher-end SCSI subsystems While they obviously don’t behave as well under high load, it is not always a problem, and the money saved may mean more additional nodes As there is usually a common shared space from a server, a robust, faster and probably more expensive disk subsystem will be better suited there for the large number of concurrent accesses The difference between raw disk and filesystem throughput becomes more evident as systems are scaled up Software RAID: distributing data across node Raw disk and file system throughput becomes more evident as systems are scaled up

Behavior of Two Systems in a Disk Intensive Setting

Caching Strategies There is only one important difference between conventional multiprocessors and clusters Availability of shared memory The only factor that cannot be hidden is the completely different memory hierarchy Usual data caching strategies may often have to be inverted Local disk is just a slower, persistent device for large term storage Faster rates can be obtained from concurrent access to other nodes Wasting other nodes resources Saturated cluster with overloaded nodes may perform worse Getting a data block from the network can provide both lower latency and higher throughput than from the local disk

Shared versus Distributed Memory

Typical Latency and Throughput for a Memory Hierarchy

Fine-tuning the OS Getting big improvements just by tuning the system is unrealistic most time Virtual memory subsystem tuning Optimizations depend on the application, but large jobs often benefit from some VM tuning Highly tuned code will fit the available memory, keep the system from paging until a very high watermark has been reached Tuning the VM subsystem has been traditional for large system as traditional Fortran code uses to overcommit memory in a huge way Networking When the application is communication-limited For bulk data transfers, increasing the TCP and UDP receive buffers, large windows and windows scaling Inside clusters, limiting the retransmission timeouts; switches tend to have large buffers and can generate important delays under heavy congestion Direct user-level protocols

Constructing Scalable Services
Introduction Environment Resource Sharing Resource Sharing Enhanced Locality Prototype Implementation and Extension Conclusions and Future Study

Introduction A complex network system may be viewed as a collection of services Resource sharing Goal: archiving maximal system performance by utilizing the available system resource efficiently Propose a scalable and adaptive resource sharing service Coordinate concurrent access to system resources Cooperation & negotiation to better support resource sharing Many algorithms for DS should be scalable The size of DS may flexibly grow as time passes The performance should also be scalable

Environment Complex network systems
Consist of a collection of WAN & LAN Various nodes (static or dynamic) Communication channels vary greatly by static attributes

Faults, Delays, and Mobility
Yield frequent changes in the environment of a nomadic host Need network adaptation

Scalability Definition and Measurement
Algorithms & techniques that work at small scale degenerate in non-obvious ways at large scale Many commonly used mechanisms lead to intolerable overheads or congestion when used in systems beyond a certain size Topology dependent scheme or an algorithm which is system-size dependent are not scalable Scalability System’s ability to increase speedup as the number of processors increase Speedup measures the possible benefits of a parallel performance over a sequential performance Efficiency is defined to be the speedup divided by number of processors

Design Principles of OS for Large Scale Multicomputers
Design a distributed system Want its performance to grow linearly with the system size The demand for any resource should be bound by a constant which is independent of the system size DSs often contain centralized elements (like file servers) Should be avoided Decentralization also assures that there is no single point of failure

Isoefficiency and Isospeed (1)
The function which determines the extent at which the size of the problem can grow as the number of processors is increased to keep the performance constant Disadvantage: its use of efficiency measurements and speedup Indication for parallel processing improvement over sequential processing, rather than means for comparing the behavior of different parallel systems

Isoefficiency and Isospeed (2)
Scalability An inherent property of algorithms, architectures, and their combination An algorithm machine combination is scalable if the achieved average speed of the algorithm on a given machine can remain constant with increasing number of processors, provided the problem size can be increased with the system size Isospeed W amount of work with N processors W’ amount of work with N’ processors for the same average speed, for the same algorithm W’ = (N’ · W) / N The ratio between amount of work & number of processors is constant

Scalability Measurement
RT: response time of the system for a problem size W W: the amount of execution code to be performed measures in the number of instructions RT’: system response time for the problem of an increased size W’ being solved on the N’-sized system (N’>N) Scalability S = { RT RT’ if < 1 1

Weak Consistency The environment complex to handle Weak consistency
High degree of multiplicity (scale) Variable fault rates (reliability) Resources with reduced capacity (mobility) Variable interconnections resulting in different sorts of latencies Weak consistency Allow inaccuracy as well as partiality State info regarding other workstations in the system is held locally in a cache Cached data can be used as a hint for decision making, enable local decisions to be made Such state info is less expensive to maintain Use of partial system views reduces message traffic Less nodes are involved in any negotiation Adaptive resource sharing Must continue to be effective & stable as the system grows

Assumptions Summary Full logical interconnection
Connection maintenance is transparent to the application Nodes have unique identifiers numbered sequentially Non negligible delays for any message exchange

Model Definition and Requirements
Purpose of resource sharing Achieve efficient allocation of resources to running applications Map & remap the logical system to the physical system Requirements Adaptability Generality Minimum overhead Stability Scalability Transparency Fault-tolerance Heterogeneity

Resource Sharing Extensively studied by DS & DAI
Load sharing algorithms provide an example of the cooperation mechanism required when using the mutual interest relation Components Locating a remote resource, information propagation, request acceptance, & process transfer policies Decision is based on weakly consistent information which may be inaccurate at times Adaptive algorithms adjust their behavior to the dynamic state of the system

Resource Sharing - Previous Study (1)
Performance of location policies with different complexity levels on load sharing algorithms Random selection Simplest Yield significant performance improvements in comparison with the no cooperation case A lot of excessive overhead is required for the remote execution attempts

Resource Sharing - Previous Study (2)
Threshold policy Probe a limited number of nodes Terminate the probing as soon as it finds a node with a queue lengths shorter than the threshold Substantial performance improvement Shortest policy Probe several nods & then selects the one having the shortest queue, from among those having queue lengths shorter than the threshold No added value to looking for the best solution but rather an adequate one Advanced algorithms may not entail a dramatic improvement in performance

Flexible Load Sharing Algorithm
A location policy: similar to Threshold algorithm Using local information which is possibly replicated at multiple node For scalability, FLS divides a system into small subsets which may overlap Not attempt to produce the best possible solution, but it offers instead an adequate one at a fraction of the cost Can be extended to other matching problems in DSs

Algorithm Analysis (1) Qualitative evaluation
Distributed resource sharing are preferred for fault-tolerance and low overhead purposes Information dissemination Use information of system subset Decision making Reduce mean response time to resource access requests

Algorithm Analysis (2) Quantitative evaluation
Performance and efficiency tradeoff Memory requirement for algorithm constructs State dissemination cost in terms of the rate of resource sharing state messages exchanged per node Run-time cost measured as the fraction of time spent running the resource access software component Percent of remote resource accesses out of all resource access requests Stability System property measured by resource sharing hit-ratio Precondition for scalability

Resource Sharing Enhanced Locality
Extended FLS No message loss Non-negligible but constrained latencies for accessing any node from any other node Availability of unlimited resource capacity Selection of new resource providers to be included in the cache is not a costly operation and need not be constrained

State Metric Positive: surplus resource capacity
Negative: resource shortage Neutral: not participate in resource sharing

Network-aware Resource Allocation

Considering Proximity for Improved Performance
Extensions to achieve enhanced locality by considering proximity Response Time of the Original and Extended Algorithms (cache size 5)

Estimate Proximity (Latency)
Use round-trip message Communication delay between two nodes Observation sequence period

Estimate Performance Improvement
System Size Extended FLS(%) Original FLS(%) 15 38 49 20 40 53 30 37 51 39 52 50 34 49.66 Percentage of Close Allocations 20 30 40 50 1 0.97 0.93 0/96 0.96 0.99 Scalability Metric for the Even Load Case System Size Uneven Load (%) Even Load (%) 15 17.99 12.36 20 21.33 16 30 19.76 21.67 40 19.15 21 50 19.45 18.55 Performance Improvement of Proximity Handling

Prototype Implementation and Extension
PVM resource manager Default policy is round-robin Ignore the load variations among different nodes Cannot distinguish between machines of different speed Apply FLS to PVM resource manager

Basic Benchmark on a System Composed of 5 and 9 Pentium Pro 200 Nodes (Each Node Produces 100 Processes)

Conclusions Enhance locality Factor influencing locality
Considering proximity Reuse of state information

Dependable Clustered Computing
Introduction Two Worlds Converge Dependable Concepts Cluster Architectures Detecting and Masking Faults Recovering from Faults The Practice of Dependable Clustered Computing

Introduction Two different areas of computing with slightly different motivations High performance (HP) High availability (HA) The clustering model can provide both HA & HP, and also manageability, scalability, & affordability To accomplish the dream, cluster software technologies still have to evolve & mature Clusters Poor man’s answer for the parallel high performance computing quest Observed in Gordon Bell prize Typically homogeneous, tightly coupled, nodes trust each other Compare with distributed systems As number of h/w components rises, so does the probability of failure (Leslie Lamport) “ a distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable” Increasing probability of fault occurrence for long-running applications

Failover In case of failure of a component in the primary system, the critical applications can be transferred to the secondary server, thus avoiding downtime and guaranteeing application’s availability Inevitably imply a temporary degradation of the service, but without total permanent loss Fault-tolerant servers, pioneered by Tandem

Two Worlds Converge Only fault tolerance can enable effective parallel computing by allowing application continuity Parallelism enables high-availability by providing redundancy

Dependable Parallel Computing (I)
Dependability Arose as an issue as soon as clusters started to be used by HP community Two main reasons As systems scale up to hundreds of nodes, the likelihood of a node failure increases proportionally with number of nodes caused not only by permanent faults (permanent damage), but mainly by transient ones (due to heating, electromagnetic interference, marginal design, transient SW faults, and power supply disturbances) Those faults that statistically don’t cause node failures, but have other insidious effects the effect of faults on application running in parallel systems a large percentage of faults do not cause nodes to crash, system panics, core dumps, or application to hang in many cases, everything goes apparently fine, but in the end, parallel application generate incorrect results

Dependable Parallel Computing (II)
Examples ASCI Red : 9000 Intel Pentium Pro (a 263 Gbytes memory machine) MTBF : 10 hours for permanent faults, 1-2 hours for transient faults (a node’s MTBF is 10 years) ORNL (MPP, not cluster) MTBF is 20 hours Carnegie Mellon University (400 nodes) MTBF is 14 hours

Mission/Business Critical Computing (I)
The demand for HA applications increases Electronic commerce, WWW, data warehouse, OLAP, etc. the cost of outage is substantial Development of dependable computing Cold-backup one of the first protection mechanism for critical applications data offline backup Hot-backup Protect data, not just at a certain time of the day, but on a continuous basis using a mirrored storage system where all data is replicated RAID were introduced to protect data integrity Advanced H/W redundancy system Redundancy at all system level to eliminate single point of failure Fault tolerant, % availability (5 mins downtime per year)

Mission/Business Critical Computing (II)
Combine highly reliable & highly scalable computing solutions with open & affordable components Can be done with clusters The ability to scale to meet the demands of service growth became a requirement High performance joined high-availability as an additional requirement for mission/business critical systems Clusters intrinsically have the ability to provide HP & scalability High availability scalable clusters Today’s mission/business critical clusters as a combination of the following capabilities Availability: one system to failover to another system Scalability: increase the number of nodes to handle load increase Performance: perform workload distribution Manageability: manage a single system as a requirement clusterware

Dependability Concepts (I)
Faults, Errors, Failures Error detection latency The time between the first occurrence of the error and the moment when it is detected The longer, the most affected the internal state of the system may become Fault When an error has some phenomenological cause Transient Re-execution can compensate for a transient fault Permanent More complex to handle Software faults (bugs) Bohrbugs – bugs are always there again when the program is re-executed Heisenbugs - in a re-execution, the timing will usually be slightly different, the bug does not happen again

Dependability Concepts (II)
Dependability Attributes Reliability Probability of that continuous correct operation Availability A measure of the probability that the system will be providing good service at a given point in time MTBF The Mean Time Between Failures MTTR The Mean Time To Repair a failure, & bring the system back to correct service Availability can be expressed as MTBF/(MTBF+MTTR)

Dependability Concepts (III)
Dependability Means Two ways Fault prevention Preventing faults from occurring Accomplished through conservative design, formal proofs, extensive testing, etc Fault tolerance Prevent errors from becoming failures 2 approaches are complementary Disaster Tolerance Techniques aimed at handling a very particular type of faults, namely site failures caused by fire, floods, sabotages, etc Usually implies having redundant hardware at a remote cite

Cluster Architectures (I)
Share-Nothing versus Shared-Storage Share-nothing cluster model Each node has its own memory & own storage resources Allow nodes to access common devices or resources, as long as these resources are owned and managed by a single system at a time Avoid the complexity of cache coherency schemes & distributed lock managers May eliminate single point of failure within the cluster storage Scalable, because of the lack of contention & overhead Several strategies aimed at reducing bandwidth overhead Shared-Storage Both servers share the same storage resource with the consequent need for synchronization between disks accesses to keep data integrity Distributed lock managers must be used Scalability becomes compromised

Shared-nothing Clusters

Cluster Architectures (II)
Active/Standby versus Active/Active Active/Standby Called hot backup Primary server where the critical application runs Secondary server (hot spare) normally in standby mode can be a lower performance machine Active/Active All servers are active & do not sit idle waiting for a fault to occur & take over Providing bidirectional-failover The price to performance ratio decrease N-Way Several active servers that back up one another

Active/Standby Clusters

N-Way Clusters

Cluster Architectures (III)
Interconnects 3 main kinds of interconnects – network, storage, monitoring Vital for clusters Network interconnect Communication channels that provides the network connection between the client systems & the cluster Generally Ethernet or fast-Ethernet Storage Provide the I/O connection between the cluster nodes & disk storage SCSI and Fiber Channel SCSI is more popular, limit in length (SCSI extender technologies) FC-AL permits fast connections - 100Mb/s - & up to 10 Km Monitoring Additional comm media used for monitoring purposes (heartbeats transfer) NCR Lifekeeper use both the network & storage interconnect for monitoring also have an additional serial link between the servers for that purpose

Cluster Interconnects

Cluster Architectures (IV)
VIA Virtual Interface (VI) Architecture An open industry specification promoted by Intel, Compaq, & Microsoft Aim at defining the interface for high performance interconnection of servers and storage devices within a cluster (SAN) VI proposes to standardize the network (SAN) Define a thin, fast interface that connects software applications directly to the networking hardware while retaining the security & protection of the OS VI-SAN Specially optimized for high bandwidth & low latency comm

Cluster Architectures (V)
Storage Area Network & Fiber Channel SAN To eliminate the bandwidth bottlenecks & scalability limitations imposed by previous storage (mainly SCSI bus-based) architectures SCSI is based on the concept of a single host transferring data to a single LUN (logical unit) in a serial manner FC-AL Emerged as the high-speed, serial technology of choice for server-storage connectivity The most widely endorsed open standard for the SAN environment High bandwidth & scalability, Support multiple protocols (SCSI, IP) over a single physical connection Modular scalability Most important advantages of the SAN approach A key to enabling infrastructure for long term growth & manageability

Detecting and Masking Faults (I)
A dependable systems is built upon several layers EDM (Error Detection Mechanism) Classified according to 2 important characteristics Error coverage The percentage of errors that are detected Detection latency The time an error takes to be detected Diagnosis layer Collect as much info as possible about the location, source & affected parts Run diagnosis over the failed system to determine whether the fault is transient or permanent Recovery & reconfiguration

Detecting and Masking Faults (II)
Self-Testing To detect errors generated by permanent faults Consists of executing specific programs that exercise different parts of the system, & validating the output to the expected known results Not very effective in detecting transient faults Processor, Memory, and Buses Parity bits - primary memories Parity checking circuitry – system buses EDAC (error detection & correction) chips common in workstation hardware Built-in error detection mechanisms inside of microprocessors Fail to detect problem occurs at a higher level or affects data in subtle ways

Detecting and Masking Faults (III)
Watchdog Hardware Timers Provide a simple way of keeping track of proper process functions If the timer is not reset before it expires, the process has probably failed in some way Provide an indication of possible process failure Coverage is limited data & results are not checked Implemented in SW or HW Loosing the Software Watchdog Software watchdog A process that monitors other process(es) for errors (simplest) Watch its application until it eventually crashes, the only action it takes is to relaunch the applications Monitored process is instrumented to cooperate with the watchdog WinFT Typical example of a SW watchdog

Detecting and Masking Faults (IV)
Heartbeats A periodic notification sent by the application to the watchdog to assert its aliveness Consist of an application–initiated “I’m Alive” message, or a request-response scheme (watchdog requests the application to assert its aliveness through “Are you Alive?” message) & wait for acknowledgement Can coexist in the same system Useless if OS crashes Clustering solutions provide several alternate heartbeat paths TCP-IP, RS-232, a shared SCSI bus Idle notification Inform the watchdog about periods when the application is idle, or it is not doing any useful work, then the watchdog can perform preventive actions (restarting application) Software Rejuvenation software gets older, the probability of faults is higher Error notification An application that is encountering errors which it can’t overcome can signal the problem to the watchdog & request recovery actions Restart may be enough

Detecting and Masking Faults (V)
Assertions, Consistency Checking, and ABFT Consistency checking Simple fault detection technique that can be implemented both at the hardware and at the programming level Performed by verifying that the intermediate or final results of some operation are reasonable At the hardware level, built-in consistency checks for checking addresses, opcodes, and arithmetic operations Range check Confirm that a computed value is within a valid range Algorithm-Based Fault Tolerance (ABFT) After executing an algorithm, the application runs a consistency check specific for that algorithm Checksum for matrix operations Extend the matrices with additional columns

Recovering from Faults (I)
Checkpointing and Rollback Checkpointing Allow a process to save its state at regular intervals during normal execution so that it can be restored later after a failure to reduce the amount of lost work When a fault occurs, the affected process can simply be restarted from the last saved checkpoint (state) rather than from the beginning Protect long-running applications against transient faults Suitable to recover applications from software errors Checkpointing classifications Transparency (data to include in the checkpoint) Checkpoints can be transparently or automatically inserted at runtime (or by the compiler) Save large amounts of data Inserted manually by the application programmer The checkpointing interval The time interval between consecutive checkpoints Critical for automatic checkpointing

Recovering from Faults (II)
Transactions A group of operations which form a consistent transformation of state operations :- DB updates, messages, or external actions ACID properties Atomicity Either all or none of the actions should happen commit or abort Consistency Each transformation should see a correct picture of the state, even if concurrent transactions are updating the state Integrity The transaction should be a correct state transformation Durability Once a transaction commit, all its effects must be preserved, even if there is a failure

Recovering from Faults (III)
Failover and Failback Failover process The core of the cluster recovery model A situation in which the failure of one node causes a switch to an alternative or backup node Should be totally automatic and transparent, without the need for administrator intervention or client manual reconnection Require a number of resources to be switched over to the new system (switch network identity) Takeover Used in fault-tolerant systems Built with multiple components running in lock-step Failback Move back the applications/clients to the original server once it is repaired Should be automatic & transparent Used efficiently for another purpose: maintenance

The Practice of Dependable Clustered Computing (I)
MS Cluster Server Wolfpack Provide HA over legacy windows applications, as well as provide tools & mechanisms to create cluster-aware applications Inherited much of the experience of Tandem in HA Tandem provided the Nonstop SW & the ServerNet from Himalaya Servers A phased approach Phase I :- support 2 node clusters & failover of one node to another Phase II :- a multinode share-nothing architecture scalable up to 16 nodes Not lock-step/fault-tolerant Not support the “moving” of running applications Not % available (max of 3-4 sec of downtime per year), but sec downtime per year Structured approach to the cluster paradigm using a set of abstractions Resource :- a program or device managed by a cluster Resource Group :- a collection of related resources Cluster :- a collection of nodes, resources, and groups Resource The basic building block of the architecture 3 states Offline (exists but is not offering service) Online (exists and offers service) Failed (not able to offer service)

The Practice of Dependable Clustered Computing (II)
NCR LifeKeeper In 1980 NCR began delivering failover products to the telecomm industry LifeKeeper in the Unix world in beginning of 90s Recently ported to Windows NT Multiple fault detection mechanisms Heartbeat via SCSI, LAN, & RS 232, event monitoring, & Log monitoring 3 different configurations Active/active, active/standby, & N-way (with N=3) Recovery action SW kits for core system components Provide a single point of administration & remote administration capabilities

The Practice of Dependable Clustered Computing (III)
Oracle Failsafe and Parallel Server Oracle DB options for clustered systems The archetype of a cluster solution specially designed for DB systems Oracle Failsafe Based on MSCS Support a max of 2 nodes (due to limitations of MSCS phase I) Targeted for workgroup and department servers Oracle Parallel Server Support a single DB running across multiple nodes in a cluster Not require MSCS Support more than 2 nodes Targeted for enterprise level systems, seeking not only HA but also scalability Support Digital Clusters for Windows NT & IBM’s “Phoenix” technology Oracle servers Support hot-backups, replication, & mirrored solutions

Deploying a High Throughput Computing Cluster
Introduction Condor Overview Software Development System Administration Summary

Introduction (1) HTC(High Throughput Computing) environment
To provide large amounts of processing capacity to customers over long periods of time by exploiting existing computing resources on the network To maximize processing capacity Must utilize heterogeneous resources Required a portable solution that encapsulates the differences between resources Include resource management framework To provide capacity over long periods of time The environment must be reliable and maintainable Surviving software and hardware failure Allowing resources to join and leave the cluster easily Enabling system upgrade and reconfiguration without significant downtime

Introduction (2) HTC software must be robust and feature-rich to meet the needs of resource owners, customers, system administrators Resource owners Must be satisfied that their rights will be respected and the policies they specify will be enforced To donate the use of their resources to the customers Customers Only if the cost of learning the complexities of the HTC system is not outweighed the benefit of additional processing capacity Will use the HTC environment to run their applications System administrators Only if system provides a tangible benefit to its users which outweighs the cost of maintaining Will install and maintain the system

Introduction (3) Resources in the HTC cluster may be distributive owned Control over powerful computing resources in the cluster is distributed among many individuals and small groups The willingness to share a resource with the HTC environment may vary for each resource owner The cluster may include some resources which are dedicated to HTC, others which are unavailable for HTC during certain hours or when the resource is otherwise in use, and still others which are available to specific HTC customers and applications Even when resources are available for HTC, the application may be allowed only limited access to the components of the resource and may be preempted at any time Distributed ownership often results in decentralized maintenance, when resource owners maintain and configure each resource for a specific use This adds an additional degree of resource heterogeneity to the cluster

Introduction (4) Describes some of the challenges faced by software developers and system administrators when deploying an HTC cluster, and some of the approaches for meeting those challenges Usability Flexibility Reliability Maintainability

Condor Overview Condor example Customer agent Resource agent
Represent customer Manages a queue of application descriptions Sends resource request to the matchmaker Resource agent Represent resource Implements the policies of the resource owner Sends resource offers to the matchmaker Matchmaker Responsible for finding matches between resource requests and resource offers Notifying the agents when a match found

Condor Resource Management Architecture

Condor overview (Cont’d)
Condor example (cont’d) Upon notification, the customer agent and the resource agent perform a claiming protocol to initiate the allocation Resource requests and offers contain constraints which specify if a match is acceptable Matchmaker implements system wide policies by imposing its own set of constraints on matches

Software Development Developer of HTC system must overcome four primary challenges Utilization of heterogeneous resources Requires system portability Portability obtained through a layered system design Evolution of network protocols Required for a system where resources and customer needs are constantly changing Requiring deployment of new features in the HTC system Remote file access mechanism Guarantees that an application will be able to access its data files from any workstation in the cluster Utilization of non-dedicated resources Requires the ability to preempt and resume an application using a checkpoint mechanism

Layered Software Architecture (1)
HTC system relies on the host OS to provide network communication, process management, workstation statistics Since the interface to the services differs on each OS, the portability of the HTC system will benefit from a layered software architecture The system is written to a system independent API, reducing the cost of porting to a new architecture

Network API Provide connection-oriented and connectionless, reliable and unreliable interfaces, with many mechanisms for authentication and encryption Perform all conversions between host and network integer byte order automatically Check for overflows Provide standard host address lookup mechanisms

Process management API Provides the ability to create, suspend, unsuspend and kill a process to enable the HTC system to control the execution of a customer’s application Parent process may pass state to new child processes, including network connections, open files, environment variables, security attributes Child process may not run an instance of the same API

Workstation statistics API Reports the information necessary to implement the resource owner policies and verify that the customer application requirements are met Standard libraries Advantage Save time in developing HTC system Better tested Disadvantage May be poor fit for specific HTC system May not be a available for all platforms May work incorrectly on some platforms

Layered Resource Management Architecture (1)
Resource management architecture of the HTC environment also benefits from a layered system design framework Layered system design in Condor Yields a modular system design The interface to each system layer is defined in the resource management model The interface allowing the implementation of each layer to change so long as the interface continues to be supported Separates the advertising, matchmaking, and claiming protocols

Layered Resource Management Architecture (2)

Protocol Flexibility As the distributed system evolves to provide new and improved services, the network protocols will be affected Protocols are augmented to transfer additional information Requires that all components of the distributed system updated to recognize the additional information In large HTC system, it is often inconvenient to update all components at one time To support the evolution, the HTC network protocols may utilize a general purpose data format which allows more flexibility Example : Condor protocol -> Similar RPC protocol Leading integer specify action, named parameter list associate with action Provide backward compatibility, since older agents will ignore the new parameters and new agents are written to accept packets with or without the new parameters

Example of Protocol Data Format

Remote File Access (1) A remote file access mechanism guarantees that an HTC application will have access to its data files from any workstation in the cluster To use an existing distributed file system The HTC environment must authenticate the customer’s application to that file system NFS authenticates via user ID AFS and NTFS authenticate via Kerberos and NT server credentials To run the customer’s application, the HTC environment may require administrator privileges on the remote workstation, ability to transparently forward credential (using the customer’s password) The customer may be required to grant file access permission to the HTC system before submitting the application for execution

Remote File Access (2) To implement data file staging
The system transfers input files to the local disk of the remote workstation before running the application The system is responsible for gathering up the application’s output files and transferring them to a destination specified by the customer Requires free disk space on the remote workstation and the bandwidth to transfer the data files at the start and end of each allocation For large data files, results in high start-up and tear-down costs compared to a block file access mechanism provided by distributed file system or redirect file I/O system calls

Remote File Access (3) Redirecting file I/O system calls
HTC environment must interpose itself between application and OS and service file I/O system calls itself Accomplished by linking the application with an interposition library or by trapping system call through OS interface Has the significant benefit that it places no file system requirement on the remote workstation Enables the HTC environment to utilize a greater number of resources Drawback is developing and maintaining a portable interposition system can be very difficult, since different OS provide different interposition techniques and the system call interface differs on each OS and often change with each new OS release

System Call Interposition

Checkpointing Checkpoint of an executing program is a snapshot of its state which can be used to restart the program from that state at a later time Provide reliability When a compute node fails, the program running on that node can be restarted from its most recent checkpoint Enables preemptive-resume scheduling All parties involved in an allocation can break the allocation at any time without losing the work already accomplished by simply checkpointing the application Most workstation OS do not provide kernel-level checkpointing service HTC environment must often rely on user-level checkpointing Developing a portable, user-level checkpointing is a significant challenge, since OS APIs for querying and setting process state vary and are often incomplete

System Administration
Administrator of an HTC environment answer to resource owners, customers, policy makers Have responsibility to guarantee that the HTC environment enforces the access policies of resource owners Since resources in a cluster are often heterogeneous and distributive owned, these policies are often complicated and vary from resource to resource Have responsibility for ensuring that customers receive valuable service from the HTC environment Involves working with customers to understand the needs and development of their applications, detect application failures, investigate the causes of the failures, develop solutions to avoid future failures Must demonstrate to policy makers that the HTC environment is meeting stated goals Requires accounting of system usage and availability

Access Policies Resource access policy specifies who may use a resource, how they may use it, and when they may use it Administrator determines an access policy in consultation with the resource owner and implement Define a set of expressions which specify when an application may begin using a resource and when and how an application must stop using a resource Requirements Rank (preference of resource owner) Suspend Continue Vacate Kill

Examples Requirements = (keyboardIdle > 15 * Minute) && (LoadAvg < 0.3) Rank = (Customer == ? 1 : 0 Suspend = (KeyboardIdle < Minute) Continue = (KeyboardIdle > 2 * Minute) Vacate = (SuspendTime > 5 * Minute) Kill = (VacateTime > 5 * Minute) Vacate = (SuspendTime > 5 * Minute) && (JobImageSize < 100*MB) Kill = (JobImageSize < 100*MB) ? (VacateTime > 5 * Minute) : (SuspendTime > 5 * Minute)

Reliability (1) Reliability is a primary concern in an HTC environment
Distributed file system can be a frequent cause of system failure File system failure may cause process crash due to page faults HTC software must react when configuration files and log files are temporarily inaccessible HTC system is responsible for enforcing the policies of the resource owner The system processes don’t fail and leave running application unattended A master process Monitor the other HTC processes to detect failures and invoke recovery mechanism Cache executable files on the local disk Serve as an administrative module (gathering statistics) Report currently running services Allow the administrator to start and stop services Detect and react to configuration changes and system upgrade

Reliability (2) HTC environment must be prepared for failures and must automate failure recovery for common failures HTC software must be able to distinguish between normal and abnormal termination of application Abnormal termination due to an environmental failure or unrecoverable application error One approach is monitor the system calls to detect source of failure System could defer to the customer, asking the customer to alert the system when application terminates abnormally

Problem Diagnosis via System Logs
System log files are the primary tools for diagnosing system failures Using log files, the administrator should be able to reconstruct the events leading up to the failure Need to manage the amount of historical log information Managing distributed log file Store logs centrally on a file server or a customized log server To provide a single interface to the distributed log files by installing logging agents on each workstation which will respond to log queries made by a client application

HTC Environment Logs Application log Customer log Resource log
System call trace Checkpointing information and statistics Remote I/O trace with statistics Errors occurring during the allocation Customer log Allocation information and statistics Application arrival and termination Matchmaking and claiming errors Resource log Policy action trace Master log HTC agent (re-)starts Administrative commands Agent upgrades Scheduling log Record of all matches Allocation history (accounting) Security log Record of all rejected requests Record of all authenticated actions

Monitoring and Accounting
HTC environment provides system monitoring and accounting facilities to the administrator Allow the administrator to assess the current and historical state of the system and to track system usage

Monitoring Resource Usage

Monitoring HTC Applications

Security (1) HTC environment is potentially vulnerable to both resource and customer attacks A resource attack Occurs when an unauthorized user gains access to a resources via the HTC environment When an authorized user violates the resource owner’s access policy A customer attack Occurs when the customer’s account or data files are compromised via the HTC environment

Security (2) Protecting the resource from unauthorized access requires an effective user authentication mechanism Explicit authorized user list Requirment = (Customer == || ( Customer ==

Security (3) The HTC system must ensure the validity of the customer attribute The resource agent can verify the customer attribute by requesting that the customer agent digitally sign the resource request with the customer’s private key Protecting against violations of the resource owner’s access policy Requires the resource agent maintain control over the application and monitor its activity Resource agent use the OS API to set resource limitation May run application under a “guest” account provides only limited access to the workstation Intercept the system calls performed by the application via the OS interposition interface

Security (4) To protect against attacker’s accessing the customer’s file, the HTC environment must ensure that all resource agents are trustworthy Untrustworthy resource agent can potentially mount a customer attack Application must have access to its data files via a remote file access mechanism Resource agent may be authenticated cryptographically The cluster can be restricted to include only resource agents on trusted hosts (authenticated via IP-address) Unencrypted network streams over the network are vulnerable to snooping HTC agents are potentially vulnerable to the common buffer-overflow attack

Remote Customers (1) It can be more convenient for both customers and administrators to provide remote access to the HTC cluster instead access directly The customer installs a customer agent on his or her workstation, and administrator allows that agent access to the HTC cluster Administrator is no longer required to create a workstation account for the customer, but instead must only create an HTC account Remote customers may require special consideration when configuring the HTC environment May not be considered as trustworthy as local customer agents, and so additional security precautions may be required

Remote Customers (2) Customer agent may connect to the cluster over a wide area network Limited bandwidth Decreased reliability Additional security No required data file transfer Caching in the remote file access

Lightweight Messaging Systems
Introduction Latency/Bandwidth Evaluation of Communication Performance Traditional Communication Mechanisms for Clusters Lightweight Communication Mechanisms Kernel-Level Lightweight Communications User-Level Lightweight Communications A Comparison Among Message Passing Systems

Introduction Communication mechanism is one of the most important part in cluster system PCs and Workstations become more powerful and fast network hardware become more affordable Existing communication software needs to be revisited in order not to be a severe bottleneck of cluster communications In this chapter A picture of the state-of-the-art in the field of clusterwide communications Classifying existing prototypes Message-passing communication NOWs and clusters are distributed-memory architectures These distributed-memory architectures are based on message-passing communication systems

Latency/Bandwidth Evaluation of Communication Performance
Major performance measurements Performance of communication systems are mostly measured by two parameters below Latency, L deals with the synchronization semantics of a message exchange Asymptotic bandwidth, B deals with the (large, intensive) data transfer semantics of a message exchange

Latency Purpose Definition Measure the latency, L
Characterize the speed of underlying system to synchronize two cooperating processes by a message exchange Definition Time needed to send a minimal-size message from a sender to a receiver From the instant the sender starts a send operation To the instant receiver is notified about the message arrival Sender and receiver are application level processes Measure the latency, L Use a ping-pong microbenchmark L is computed as half the average round-trip time (RTT) Discard the first few data for excluding “warm-up” effect

End-to-end and One-sided Asymptotic Bandwidth (I)
Purpose Characterizes how fast a data transfer may occur from a sender to a receiver “Asymptotic”: the transfer speed is measured for a very large amount data One, bulk, or stream Definition of asymptotic bandwidth, B B = S/D D is the time needed to send S bytes of data from a sender to a receiver S must be very large in order to isolate the data transfer from any other overhead related to the synchronization semantics

End-to-end and One-sided Asymptotic Bandwidth (II)
Measure the asymptotic bandwidth, B End-to-end Use a ping-pong microbenchmark to measure the average round-trip time D is computed as half the average round-trip time This measures the transfer rate of the whole end-to-end communication path One-sided Use a ping microbenchmark to measure the average send time This measures the transfer rate as perceived by the sender side of the communication path, thus hiding the overhead at the receiver side D is computed as the average data transfer time (not divided by 2) The value of one-sided is greater than one of end-to-end

Throughput Message Delay, D(S) Definition of throughput, T(S)
D(S) = L + (S – Sm)/B Sm is the minimal message size allowed by the system half the round-trip time (ping-pong) data transfer time (ping) Definition of throughput, T(S) T(S) = S/D(S) It is worth nothing that the asymptotic bandwidth is nothing but the throughput for a very large message A partial view of the entire throughput curve T(Sh) = B / 2 Sh: the message size

Traditional Communication Mechanisms for Clusters
Interconnection of standard components They focus on the standardization for interoperation and portability than efficient use of resources TCP/IP , UDP/IP and Sockets RPC MPI and PVM Active Message

TCP, UDP, IP, and Sockets The most standard communication protocols
Internet Protocol (IP) provides unreliable delivery of single packets to one-hop distant hosts implements two basic kinds of QoS connected, TCP/IP datagram, UDP/IP Berkeley Sockets Both TCP/IP and UDP/IP were made available to the application level through the API, namely Berkeley Sockets Network is perceived as a character device, and sockets are file descriptors related to the device Its level of abstraction is quite low

RPC Remote Procedure Call, RPC Familiarity and generality
Enhanced general purpose (specially distributed client-server applications) network abstraction atop socket The de facto standard for distributed client-server applications Its level of abstraction is high Familiarity and generality sequential-like programming Services are requested by calling procedures with suitable parameters. The called service may also return a result hiding any format difference It hides any format difference across different systems connected to the network in heterogeneous environment

MPI and PVM General-purpose systems Parallel Virtual Machine (PVM)
the general-purpose systems for message passing and parallel program management on distributed platforms at the application level, based on available IPC mechanisms Parallel Virtual Machine (PVM) provides an easy-to-use programming interface for process creation and IPC, plus a run-time system for elementary application management run-time programmable but inefficient Message Passing Interface (MPI) offers a larger and more versatile set of routines than PVM, but does not offer run-time management systems greater efficient compared to PVM

Active Message (I) One-sided communication paradigm Reducing overhead
Whenever the sender process transmits a message, the message exchange occurs regardless of the current activity of the receiver process Reducing overhead The goal is to reduce the impact of communication overhead on application performance Active Message Eliminates the need of many temporary storage for messages along the communication path With proper hardware support, it is easy to overlap communication with computation As soon as delivered, each message triggers a user-programmed function of the destination process, called receiver handler The receiver handler act as a separate thread consuming the message, therefore decoupling message management from the current activity of the main thread of the destination process

Active Message Architecture
AM-II API Virtual Network Firmware Hardware

Active Message Communication Model

Conceptual Depiction of Send-Receive Protocol for Small Messages

Conceptual Depiction of Send-Receive Protocol for Large Messages

Active Message (II) AM-II API Support three types of message
short, medium, bulk Return-to-sender error model all the undeliverable messages are returned to their sender applications can register per-endpoint error handlers

Active Message (III) Virtual networks
An abstract view of collections of endpoints as virtualized interconnects Direct network access an endpoint is mapped into a process’s address space Protection use standard virtual memory mechanism On-demand binding of endpoint to physical resources NI memory: endpoint cache – active endpoint host memory: less active endpoint endpoint fault handler

Active Message (IV) Firmware Endpoint scheduling Flow control
weighted round-robin policy skip empty queues 2k attempts to send, where k=8 Flow control fill the communication pipe between sender and receiver prevent receiver buffer overrun three level flow control user-level credit base flow control NIC level stop-and-wait flow control link-level back pressure channel management tables channel-based flow control timer management timeout and retransmission error handling detect duplicated or dropped messages sequence number & timestamp detect unreachable endpoints timeout & retransmission detect & correction other errors user-level error handler

Active Message (V) Performance
100 Ultra SPARC stations & 40 Myrinet switches 42 µs round trip time 31 MB/s bandwidth

Lightweight Communication Mechanisms
Lightweight protocols Cope with the lack of efficiency of standard communication protocols for cluster computing Linux TCP/IP is not good for cluster computing Performance test in Fast Ethernet environments Pentium II 300 MHz, Linux kernel 2 PCs are connected by UTP ported 3Com 3c905 Fast Ethernet results latency = 77 µs (socket) / 7 µs (card) bandwidth large data stream: 86% short message (<1500 bytes): less than 50% Drawbacks of layered protocols memory-to-memory copy poor code locality heavy functional overhead

Linux 2.0.29 TCP/IP Sockets: Half-Duplex “Ping-Pong” Throughput with Various NICs and CPUs

What We Need for Efficient Cluster Computing (I)
To implement an efficient messaging system Choose an appropriate LAN hardware ex) 3COM 3c905 NIC can be programmed in two way In descriptor-based DMA (DBDMA), the NIC itself performs DMA transfers between host memory and the network by ‘DMA descriptors’ In CPU-driven DMA, this leads to a ‘store-and-forward’ behavior Tailor the protocols to the underlying LAN hardware ex) flow control of TCP TCP avoids packet overflow at the receiver side, but cannot prevent overflow to occur in a LAN switch In cluster computing, the overflow in a LAN switch is important

What We Need for Efficient Cluster Computing (II)
To implement an efficient messaging system Target the protocols to the user needs Different users and different application domains may need different tradeoffs between reliability and performance Optimize the protocol code and the NIC driver as much as possible Minimize the use of memory-to-memory copy operation ex) TCP/IP TCP/IP is the layered structure needed memory-to-memory data movements

Typical Techniques to Optimize Communication (I)
Using multiple networks in parallel Increases the aggregate communication bandwidth Can not reduce latency Simplifying LAN-wide host naming Addressing conventions in a LAN might be simpler than in a WAN Simplifying communication protocol Long protocol functions are time-consuming and have poor locality that generates a large number of cache misses General-purpose networks have a high error rate, but LANs have a low error rate Optimistic protocols assume no communication errors and no congestion

Typical Techniques to Optimize Communication (II)
Avoiding temporary buffering of messages Zero-copy protocols remapping the kernel-level temporary buffers into user memory space lock the user data structures into physical RAM and let the NIC access them directly upon communication via DMA need gather/scatter facility Pipelined communication path Some NICs may transmit data over the physical medium while the host-to-NIC DMA or programmed I/O transfer is still in progress The performance improvement is obtained at both latency and throughput

Typical Techniques to Optimize Communication (III)
Avoid system calls for communication Invoking a system call is a time-consuming task User-level communication architecture implements the communication system entirely at the user level all buffers and registers of the NIC are remapped from kernel space into user memory space protection challenges in a multitasking environment Lightweight system calls for communication Eliminate the need of system calls save only a subset of CPU registers and do not invoke the scheduler upon return

Typical Techniques to Optimize Communication (IV)
Fast interrupt path In order to reduce interrupt latency in interrupt-driven receives, the code path to the interrupt handler of the network device driver is optimized Polling the network device The usual method of notifying message arrivals by interrupts is time-consuming and sometimes unacceptable Provides the ability of explicitly inspecting or polling the network devices for incoming messages, besides interrupt-based arrival notification Providing very low-level mechanisms A kind of RISC approach Provide only very low-level primitives that can be combined in various ways to form higher level communication semantics and APIs in an ‘ad hoc’ way

The Importance of Efficient Collective Communication
To turn the potential benefits of clusters into widespread use The development of parallel applications exhibiting high enough performance and efficiency with a reasonable programming effort Porting problem An MPI code is easily ported from one hardware platform to another But performance and efficiency of the code execution is not ported across platforms Collective communication Collective routines often provide the most frequent and extreme instance of “lack of performance portability” In most cases, collective communications are implemented in terms of point-to-point communications arranged into standard patterns This implies very poor performance with clusters As a result, parallel programs hardly ever rely on collective routines

A Classification of Lightweight Communication Systems (I)
kernel-level systems and user-level systems Kernel-level approach The messaging system is supported by the OS kernel with a set of low-level communication mechanisms embedding a communication protocol Such mechanisms are made available to the user level through a number of OS system calls Fit into the architecture of modern OS providing protected access A drawback is that traditional protection mechanisms may require quite a high software overhead for kernel-to-user data movement

A Classification of Lightweight Communication Systems (II)
User-level approach Improves performance by minimizing the OS involvement in the communication path Access to the communication buffers of the network interface is granted without invoking any system calls Any communication layer as well as API is implemented as a user-level programming library To allow protected access to the communication devices single-user network access unacceptable to modern processing environment strict gang scheduling inefficient, intervening OS scheduler Leverage programmable communication devices uncommon device Addition or modification of OS are needed

Kernel-level Lightweight Communications
Industry-Standard API system Beowulf Fast Sockets PARMA2 Best-Performance system Genoa Active Message MAchine (GAMMA) Net* Oxford BSP Clusters U-Net on Fast Ethernet

Industry-Standard API Systems (I)
Portability and reuse The main goal besides efficiency is to comply an industry-standard for the low-level communication API Does not force any major modification to the existing OS, a new communication system is simply added as an extension of the OS itself Drawback Some optimization in the underlying communication layer could be hampered by the choice of an industry standard

Industry-Standard API Systems (II)
Beowulf Linux-based cluster of PCs channel bonding two or more LANs in parallel topology two-dimensional mesh two Ethernet cards on each node are connected to horizontal and vertical line each node acts as a software router

Industry-Standard API Systems (III)
Fast Sockets implementation of TCP sockets atop an Active Message layer socket descriptors opened at fork time are shared with child processes poor performance: UltraSPARC 1 connected by Myrinet 57.8 s latency due to Active Message 32.9 MB/s asymptotic bandwidth due to SBus bottleneck

Industry-Standard API Systems (IV)
PARMA2 To reduce communication overhead in a cluster of PCs running Linux connected by Fast Ethernet eliminate flow control and packet acknowledge from TCP/IP simplify host addressing Retain BSD socket interface easy porting of applications (ex. MPICH) preserving NIC driver Performance: Fast Ethernet and Pentium 133 74 s latency, 6.6 MB/s (TCP/IP) 256 s latency for MPI/PARMA (402 s for MPI) 182 s latency for MPIPR

Best-Performance Systems (I)
Simplified protocols designed according to a performance-oriented approach Genoa Active Message Machine (GAMMA) Active Message-like communication abstraction called Active Ports allowed a zero-copy optimistic protocol Provide lightweight system call, fast interrupt path, and pipelined communication path Multiuser protected access to network Unreliable: raise error condition without recovery Efficient performance (100base-T) 12.7 s latency, 12.2 MB/s asymptotic bandwidth

Best-Performance Systems (II)
Net* A communication system for Fast Ethernet based upon a reliable protocol implemented at kernel level Remap kernel-space buffers into user-space to allow direct access Only a single user process per node can be granted network access Drawbacks no kernel-operated network multiplexing is performed user processes have to explicitly fragment and reassemble messages longer than the Ethernet MTU Very good performance 23.3 s latency and 12.2 MB/s asymptotic bandwidth

Best-Performance Systems (III)
Oxford BSP Clusters Place some structural restriction on communication traffic by allowing only some well known patterns to occur good to optimizing error detection and recovery A parallel program running on a BSP cluster is assumed to comply with the BSP computational model Protocols of BSP clusters destination scheduling is different from processor to processor switched network using exchanged packets as acknowledgement packets BSPlib-NIC the most efficient version of the BSP cluster protocol has been implemented as a device driver called BSPlib-NIC remapping the kernel-level FIFOs of the NIC into user memory space to allow user-level access to the FIFOs no need to “start transmission” system calls along the whole end-to-end communication path Performance (100base-T) 29 s latency, 11.7 MB/s asymptotic bandwidth

Best-Performance Systems (IV)
U-Net on Fast Ethernet Require a NIC’s programmable onboard processor The drawback is the very raw programming interface Performance (100base-T) 30 s one-way latency, 12.1 MB/s asymptotic bandwidth

U-Net User-level network interface for parallel and distributed computing Design goal Low latency, high bandwidth with small messages Emphasis protocol design and integration flexibility Portable to off-the-shelf communication hardware Role of U-Net Multiplexing Protection Virtualization of NI

Traditional Communication Architecture

U-Net Architecture

Building Blocks of U-Net

User-Level Lightweight Communications (I)
User-level approach Derived from the assumption that OS communications are inefficient by definition The OS involvement in the communication path is minimized Three solutions to guarantee protection Leverage programmable NICs support for device multiplexing Granting network access to one single trusted user not always acceptable Network gang scheduling exclusive access to the network interface

User-Level Lightweight Communications (II)
Basic Interface for Parallelism (BIP) Implemented atop a Myrinet network of Pentium PCs running Linux Provide both blocking and unblocking communication primitives Send-receive paradigm implemented according to rendezvous communication mode Policies for performance a simple detection feature without recovery fragment any send message for pipelining get rid of protected multiplexing of the NIC the register of the Myrinet adapter and the memory regions are fully exposed to user-level access Performance 4.3 s latency, 126MB/s bandwidth TCP/IP over BIP: 70 s latency, 35MB/s bandwidth MPI over BIP: 12 s latency, 113.7MB/s bandwidth

User-Level Lightweight Communications (III)
Fast Messages Active Message-like system running on Myrinet-connected clusters Reliable in-order delivery with flow control and retransmission Works only in single-user mode Enhancement in FM 2.x the programming interface for MPI gather/scatter features MPICH over FM: 6 s Performance 12 s latency, 77 MB/s bandwidth (FM 1.x, Sun SPARC) 11 s latency, 77 MB/s bandwidth (FM 2.x, Pentium)

User-Level Lightweight Communications (IV)
Hewlett-Packard Active Messages (HPAM) an implementation of Active Messages on a FDDI-connected network of HP 9000/735 workstations provides protected, direct access to the network by a single process in mutual exclusion reliable delivery with flow control and retransmission. Performance (FDDI) 14.5 s latency, 12 MB/s asymptotic bandwidth

User-Level Lightweight Communications (V)
U-Net for ATM User processes are given direct protected access to the network device with no virtualization The programming interface of U-Net is very similar to the one of the NIC itself Endpoints The interconnect is virtualized as a set of ‘endpoints’ Endpoint buffers are used as portions of the NIC’s send/receive FIFO queues Endpoint remapping is to grant direct, memory-mapped, protected access Performance (155 Mbps ATM) 44.5 s latency, 15 MB/s asymptotic bandwidth

User-Level Lightweight Communications (VI)
Virtual Interface Architecture (VIA) first attempt to standardize user-level communication architectures by Compaq, Intel, and Microsoft specifies a communication architecture extending the basic U-Net interface with remote DMA (RDMA) services characteristics SAN with high bandwidth, low latency, low error rate, scalable, and highly available interconnects error detection in communication layer protected multiplexing in NIC among user processes reliability is not mandatory M-VIA the first version of VIA implementation on Fast/Gigabit Ethernet kernel-emulated VIA for Linux Performance (100base-T) 23 s latency, 11.9 MB/s asymptotic bandwidth

A Comparison Among Message Passing Systems
Clusters vs. MPPs Standard Interface approach vs. other approach User level vs. kernel-level

“Ping-Pong” Comparison of Message Passing Systems

Network RAM Introduction Remote Memory Paging
Network Memory File Systems Applications of Network RAM in Database Summary

Introduction (I) Discuss efficient distribution & exploitation of main memory in a cluster Network RAM or Network Memory the aggregate main memory of the workstation in a cluster local memory vs. remote memory A significant amount of resources unused in the cluster at any given time 80-90% of workstation will be idle in the evening & late afternoon 1/3 of workstations are completely unused, even at the busiest times of day 70-85% of the Network RAM in a cluster is idle Consist of volatile memory but still can offer good reliability use some form of redundancy, such as replication of data or parity offer an excellent cost-effective alternative to magnetic disk storage

Introduction (II) Technology trend High performance networks
switched high performance local area network : ex) myrinet low latency communication and messaging protocols : ex) U-Net latency of Active Message on ATM (20 s) vs. magnetic disk latency (10ms) High performance workstations memories with low cost Network RAM even more attractive The I/O processor performance gap the performance improvement rate for microprocessor: % per year disk latency: 10% per year network latency: 20% per year network bandwidth: 45% per year

Memory Hierarchy Figures for a Typical Workstation Cluster
A typical NOW with 100 workstations A network of 20 s latency & 15 MB/s bandwidth

Introduction (III) Issues in Using Network RAM
Using network RAM consists essentially of locating & managing effectively unused memory resources in a workstation cluster keeps up-to-date information about unused memory exploit memory in a transparent way so that local users will not notice any performance degradation require the cooperation of several different systems Common use of Network RAM Remote memory paging performance: between the local RAM and the disk Network memory filesystems can be used to store temporary data by providing Network RAM with a filesystem abstraction Network memory DBs can be used as a large DB cache and/or a fast non-volatile data buffer to store DB sensitive data

Remote Memory Paging (I)
64-bit address space heavy memory usage application’s working sets have increased dramatically sophisticated GUI, multimedia, AI, VLSI design tools, DB and transaction processing systems Network RAM equivalent of remote memory paging (cache) – (main memory) – (network RAM) – (disk) faster than the disk all memory-intensive applications could benefit from Network RAM useful for mobile or portable computers, where storage space is limited

Changing the Memory Hierarchy

Remote Memory Paging (II)
Implementation Alternatives Main idea of remote memory paging start memory server processes on workstations that are either idle or lightly loaded and have sufficient amount of unused physical memory A policy for the page management which local pages? which remote nodes? client periodically asks the global resource manager distribute the pages equally among the servers how server load? negotiation from server to client & migration transfer data to server’s disk

Remote Paging System Structure

Remote Memory Paging (III)
Implementation Alternatives the client workstation has a remote paging subsystem user-level in the OS enable it to use network memory as backing store a global resource management process running on a workstation in the cluster hold information about the memory resources & how they are being utilized in the cluster a configuration server (registry server) process responsible for the setup of the network memory cluster is contacted when a machine wants to enter or leave the cluster authenticated according to the system administrator’s rules

Remote Memory Paging (IV)
Client remote paging subsystem 2 main components a mechanism for intercepting & handling page faults a page management policy keep track of which local pages are stored on which remote nodes 2 alternative implementations of transparent remote paging subsystem on the client workstation using the mechanisms provided by the existing client’s OS by modifying the virtual memory manager & its mechanisms in the client’s OS kernel

Transparent Network RAM Implementation using Existing OS Mechanisms (I)
User Level Memory Management each program uses new memory allocation & management calls, which are usually contained in a library dynamically or statically linked to the user applications require some program code modifications prototype by E. Anderson a custom malloc library using TCP/IP, perform a 4K page replacement in 1.5 – 8.0 ms (1.3 – 6.6 faster) mechanism intercept page faults through the use of segmentation faults a segmentation fault signal from OS memory manager this signal intercepted by the user level signal handler routines page management algorithm is called to locate & transfer a remote page into local memory, replacing a local page

User Level Memory Management Solution’s Structure

Transparent Network RAM Implementation using Existing OS Mechanisms (II)
Swap Device Driver the OS level implementation all modern operating system provide virtual memory manager which swap pages to the first device on the list of swap devices which will be the remote memory paging device drivers create a custom block device & configure the virtual memory manager to use it for physical memory backing store when run out of physical memory, it will swap pages to the first device on the list (may be the remote memory paging device driver) if and when this Network RAM device runs out of space, the virtual memory manager will swap pages to the next device simple in concept & require one minor modification to the OS the addition of a device driver

Swap Device Driver Solution’s Structure

Transparent Network RAM Implementation using Existing OS Mechanisms (III)
Page Management Policies which memory server should the client select each time to send pages global resource management process distribute the pages equally among the servers handling of a server’s pages when its workstation becomes loaded inform the client & the other servers of its loaded state transfer them to the server’s disk & deallocate them each time they are referenced transfer all the pages back to the client when all servers are full store the pages on disk

Transparent Network RAM Implementation Through OS Kernel Modification (I)
OS modification software solution offers the highest performance without modifying user program an arduous task not portable between architectures Kernel modification provide global memory management in a workstation cluster using a single unified memory manager as low-level component of OS that runs on each workstation can integrate all the Network RAM for use by all higher-level functions, including virtual memory paging, memory mapped files and filesystem caching

Operating System Modification Architecture

Global memory management
Transparent Network RAM Implementation Through OS Kernel Modification (II) Global memory management Page fault in node P in GMS the faulted page is in the global memory of another node Q the faulted page is in the global memory nodes of node Q, but P’s memory contains only local pages the page is on disk the faulted page is a shared page in the local memory of another node Q

Boosting Performance with Subpages
Subpages are transferred over the network much faster than whole pages not with magnetic disk transfers Using subpages allow a large window for the overlap of computation with communication

Reliability (I) Main disadvantages of remote memory paging
security: can be resolved through the use of a registry reliability (fault-tolerant) Use a reliable device the simplest reliability policy reliable device such as the disk async write but, bandwidth limited can achieve high availability but simple performance under failure disk memory overhead (expensive) Replication – mirroring using page replication to remote memory without using disk (-) use a lot of memory space, high network overhead

Reliability (II) Simple parity Parity logging
extensively used in RAIDs, based on XOR operation reduce memory requirement (1+1/S) network bandwidth & computation overhead if two servers have failure??? Parity logging reduce the network overhead close to (1 + 1/S) difficult problems from page overwriting

An Example of the Simple Parity Policy (Each server in this example can store 1000 pages)

A Parity Logging Example with Four Memory Servers and One Parity Server

Remote Paging Prototypes (I)
The Global Memory Service (GMS) A modified kernel solution for using Network RAM as a paging device (reliability) Write data asynchronously to disk The prototype was implemented on the DEC OSF/1 OS under real application workloads, show speedups of 1.5 to 3.5 respond effectively to load changes Generally very efficient system for remote memory paging

Pros and Cons of GMS

Remote Paging Prototypes (II)
The Remote Memory Pager use the swap device driver approach the prototype was implemented on the DEC OSF/1 v3.2 OS consist of a swap device driver on the client machine & a user-level memory server program runs on the servers effective remote paging system no need to modify the OS

Pros and Cons of the Remote Memory Pager

Network Memory File System (I)
Use the Network RAM as a filesystem cache, or directly as a faster-than-disk storage device for file I/O Using Network Memory as a File Cache all filesystem use a portion of the workstation’s memory as a filesystem cache two problem multiple cached copies waste local memory no knowledge of the file’s existence several ways to improve the filesystem’s caching performance eliminate the multiple cached copies create a global network memory filesystem cache ‘cooperate caching’

Network Memory File System (II)
Network RamDisks Disk’s I/O performance problem Use reliable network memory for directly storing temporary files Similar to Remote Memory Paging Can be easily implemented without modifying OS (Device Driver) Can be accessed by any filesystem (NFS) a block device that unifies all the idle main memories in a NOW under a disk interface behaves like any normal disk, but implemented in main memory (RAM) (diff) instead of memory pages, send disk blocks to remote memory disk blocks are much smaller than memory pages (512 or 1024 bytes) and on many OSs variable in size (from 512 bytes to 8 Kbytes)

Applications of Network RAM in Databases (I)
Transaction-Based Systems to substitute the disk accesses with Network RAM accesses reducing the latency of a transaction a transaction during its execution makes a number of disk accesses to read its data, makes some designated calculations on that data, writes its results to the disk and, at the end, commits atomicity & recoverability 2 main areas where Network RAM can be used to boost a transactions-based system’s performance at the initial phase of a transaction when read requests from the disk are performed through the use of global filesystem cache speed up synchronous write operations to reliable storage at transaction commit time transaction-based systems make many small synchronous writes to stable storage

Applications of Network RAM in Databases (II)
Transaction-Based Systems Steps performed in a transaction-based system that uses Network RAM at transaction commit time, the data are synchronously written to remote main memory concurrently, the same data are asynchronously sent to the disk the data have been safely written to the disk In the modified transaction-based system a transaction commits after the second step replicate the data in the main memories of 2 workstations Results the use of Network RAM can deliver up to 2 orders of magnitude higher performance

Summary Emergence of high speed interconnection network added a new layer in memory hierarchy (Network RAM) Boost the performance of applications remote memory paging file system & database Conclusion Using Network RAM results in significant performance improvement Integrating Network RAM in existing systems is easy device driver, loadable filesystem, user-level code The benefits of Network RAM will probably increase with time gap between memory and disk continue to be widen Future Trend Reliability & Filesystem Interface

Distributed Shared Memory
Introduction Data Consistency Network Performance Issues Other Design Issues Conclusions

Introduction (I) Loosely-coupled distributed systems
message passing RPC Tightly-coupled architectures (multi-processor) shared memory simple programming model Shared memory gives transparent process-to-process communication, and easy programming Distributed shared memory extended for use in more loosely-coupled architecture processes shares data transparently across node boundaries allow parallel programs to execute without modifications programs are shorter and easier than equivalent message-passing applications are slower than hand-coded message passing consistency maintenance problem

Distributed Shared Memory

What DSM can Contribute?
Automatic data distribution Data gets pulled where its need on demand Location transparency Transparent coherence management Shared data automatically kept consistent Can build fault tolerance and security

Introduction (II) DSM vs. Message-Passing
A well built message-passing implementation outperform shared-memory Relative costs of communication and computation are important factors In some cases, DSM can execute faster than message-passing

Introduction (III) Main Issues of DSM Data Consistency
Data Location Write Synchronization Double Faulting Relaxing Consistency Application/Type-specify Consistency Network Performance Issues Other Design Issues Synchronization Granularity Address-Space Structure Replacement Police Heterogeneity Support Fault Tolerance Memory Allocation Data Persistence

Data Consistency Network latency Three consistency model
caching and consistency algorithms Three consistency model strict consistency - “sequential consistency” (Lemport) loosely consistency - synchronization points (compiler or user) no consistency - user-level synchronization and consistency Many older DSM systems - strict consistency single writer/multiple reader model need one data locating node data is replicated on two or more nodes, none has write access two problem must locate a copy of the data when it is non-resident must synchronize accesses to the data, when writing at the same time But the performance benefits of relaxing the consistency model

Data Consistency - Data Location
Assign an “owner” node to each item of data The owner is allowed to write to the data Ownership algorithm Fixed ownership each piece of data is assigned a fixed owner from the address of the data (ex. by hashing ) Dynamic ownership Manager: not own the data, but track the current owner Centralized A node is selected as the “manager” for all the shared data Distributed spilt the ownership information among the nodes Broadcast faulting processors send a broadcast message to find the current owner of each data item Dynamic search “probable” owner of each data item in worst case, O(p+Klog p)

Homeless and Home-based Software DSMs
P1 P3 P2 Cache Home m-1 m 2m-1 2m 3m-1 P1 P3 P2 Cache m-1 X RO X RW X INV COMA architecture The local memory of each machine is like a large cache. Shared pages are replicated and migrated dynamically according to sharing pattern Example: TreadMarks NUMA architecture Each shared page has a home to which all writes are propagated and from which all copies are derived JIAJIA takes home-based architecture

Memory Organization of Homeless Software DSM
P1 P3 P4 P2 Cache m-1 X INV X RW X RO A page may be cached by any processor, any processor may be the current owner of the page No cache replacement mechanism is implemented (difficulty: the last copy problem), the shared space is limited by the memory of a single machine

Memory Organization of Home-based Software DSM
P1 P3 P4 P2 Cache Home a Uniform page mapping => no address translation Implements cache replacement => support large shared space

Data Consistency – Write Synchronization
In maintaining a strictly-consistent data set Write Broadcast (write-update) can broadcast writes too expensive to be used if there is not a low level broadcast primitive, it’s a problem Invalidation (write invalidate) can invalidate all other copies before doing an update broadcasting is expensive (in large DSM system) use unicast or multicast rather than broadcast problem of location the manager site maintains a “copy-set” distributing the copy-set data invalidation are passed down the tree

Data Consistency - Double Faulting
Read & write fault: two network transactions one to obtain a read-only copy of the page a second to obtain write access to the page the page is transmitted twice over the network Algorithms to solve double faulting problem Hot Potato all faults are treated as write faults the ownership and data are transferred on very fault Li transfer ownership on a read fault subsequent write fault may be handled locally Shrewd a sequence number per copy of a page use to track where a page has been updated don’t eliminate the extra network transactions The crucial factor appears to be the read-to-write ratio of the application

Data Consistency – Relaxing Consistency (I)
Permitting temporary inconsistencies = increasing performance Memory is “loosely consistent” If the value returned by a read operation is the value written by an update operation to the same object that “could” have immediately preceded the read operation in some legal schedule of the threads in execution More than one thread may have write access to the same object, provided that the programmer knows that the writes will not conflict Delayed updates virtual synchrony - message operations appear to be synchronous semantic knowledge - when it is needed to access or update? compromise the shared memory concept & increase complexity

Data Consistency – Relaxing Consistency (II)
Release Consistency ordinary & synchronization-related accesses (acquire, release) the “acquire” operation signals that shared data is needed a processor’s updates after an “acquire” are not guaranteed to be performed at other nodes until a “release” is performed P1 P2 P3

Data Consistency – Relaxing Consistency (III)
Entry Consistency in a DSM system called Midway weaker than other models it requires explicit annotations to associate synchronization object with data on an “acquire”, only the data associated with the synchronization object is guaranteed to be consistent

Data Consistency – Relaxing Consistency (IV)
Release Consistency P1 x=1 Acq(l1) y=1 Rel(l1) P3 Acq(l2) a=x b=y c=z Rel(l2) P2 z=1 Results：P2: a=b=1; P3:a=b=c=1 Scope Consistency P1 x=1 Acq(l1) y=1 Rel(l1) P3 Acq(l2) a=x b=y c=z Rel(l2) P2 z=1 Results：P2: a=0,b=1; P3:a=b=0,c=1

Data Consistency - Application/Type-specific Consistency (I)
Problem-oriented shared memory implement fetch and store operations specialized to the particular problem or application provide a specialized form of consistency and consistency maintenance that exploits application-specific semantics types of objects synchronization: lock objects, managed using distributed lock system private: not managed by the runtime, but brought back under the control of DSM if referenced remotely write-once: initialized once and then only read, so may be efficiently replicated result: not read until fully updated, use the delayed-updated mechanism to maximum benefit producer consumer: written by one thread, read by a fixed set of threads perform “eager object movement”, update are transmitted to the reader before they request them

Data Consistency - Application/Type-specific Consistency (II)
Problem-oriented shared memory types of objects migratory: accessed by a single thread at a time, integrating the movement of the object with the movement of the lock associated with the object write-many: written by many different nodes at the same time, sometimes without being read in between, handled using delayed updates read-mostly: updated rarely using broadcasting general read-write: multiple threads are reading and writing, or no hint of object type is given 30% performance improvement is obtained Programmer’s control open implementation (meta-protocol)

Network Performance Issues
Communication latency and bandwidth are so bad that affect design decisions of DSM system Much larger than a multi-processor Consistency and cache management (ex, data invalidation) Recent advance in network ATM, SONET, switching technology Bandwidth are available over 1 Gbps even over wide-area network Processor speeds is increasing Latency is still higher than tightly-coupled architecture Key factor is the computation-to-communication ratio high ratios are suitable to message-passing mechanism analogues to the “memory wall”

Other Design Issues - Synchronization
It orders and controls accesses to shared data before actually accessing the data Clouds merges locking with the cache consistency protocol IVY - synchronization primitive (eventcounts) Mermaid - P and V operations and events Munin provides a distributed lock mechanism using “proxy objects” to reduce network load Anderson - locking using an Ethernet-style back-off algorithm (instead of spin and wait locks) Goodman - propose a set of efficient primitive to DSM case

Other Design Issues - Granularity
When caching objects in local memory Use fixed block size The choice of the block size Cost of communication The locality of reference in the application False sharing or Ping-ping effect Virtual-memory management unit or multiple thereof The lowest common multiple of hardware-supported page size in heterogeneous machines Memnet - fixed small granularity (chunk = 32byte) similar to shared memory multiprocessor system Mether - a software implementation of Memnet on top of SunOS 8K page size

Other Design Issues – Address-Space Structure
The arrival of 64-bit processors  single shared address-space systems (single level store) The system appears as a set of threads executing in a single shared distributed address space Data items always appear at the same addresses on all nodes A private and a shared region Security and protection are a major problem Separate Shared Address Spaces Divide each process’s address space into different fixed regions Cloud - O (shared), P(local/user), K(local/kernel) Objects always appear at the same address, but may not be visible from every address space Shared-Region Organization How the shared region itself is organized? Some DSM (IVY, Mether) use a single flat region Most of DSM use paged segmentation The shared region consists of disjoint pieces that usually managed separately

Other Design Issues - Replacement Policy and Secondary Storage
A policy to deal with cache overflow LRU does not work well without modification invalidated page read-only page replicated page writable page/pages don’t have replicas (Markatos, Dramitinos) use of main memory in other nodes to store pages (Memnet) reserve part of main memory of each node as storage for chunk (IVY) piggybacking memory-state information on other network message (Feeley) use of global memory for file buffer caching, VM for applications, and transactions

Other Design Issues - Heterogeneity Support
Attempts to provide support for heterogeneous systems Mermaid, CMU’s Mach system, DiSOM Page-based Heterogeneity support for different page sizes without considering the data of pages Mach: use scheduler page size unit It’s not as simple as it sounds real heterogeneous-data support (byte-order, word-size) Endian problem, word size CMU DSM divide this problem to hardware data-type and software-type Language-assisted Heterogeneity DiSOM, object-oriented and operates at the language level the application programmer should provide customized packing and unpacking routines for heterogeneity (marshalling) Gokhale and Minnich, packing at compile time load/store is 2-6 times slower

Other Design Issues – Fault Tolerance
Most of DSM ignore the fault tolerance issue DSM system would strongly effect the fault tolerance of a system. N systems are sharing access to a set of data, the failure of any one of them could lead to the failure of all the connected sites How do you handle a failed page-fault? Clouds: a transactional system called Invocation-Based Concurrency Control (IBCC) using commit Wu and Fuchs: incremental checkpointing system based on a twin-page disk storage management system Ouyang and Heiser: two-phase commit protocol DiSOM: checkpoint-based failure recovery Casper: shadow paging

Other Design Issues – Memory Allocation
The same piece of shared memory must not be allocated more than once Amber allocates large chunks of the shared address space to each site at startup, and has an address-space server that hands out more memory on demand

Other Design Issues – Data Persistence
DSM support some form of secondary storage persistence for data held in shared memory It’s not easy to obtain a globally-consistent snapshot of the data on secondary storage because of the distributed data Overlap between DSM and database system Garbage collection: global garbage collection Naming and location issues

Conclusions A useful extra service for a distributed system
Provide a simple programming model for a wide range of parallel processing problems Advantages no explicit communication is necessary may be made of locality of reference Disadvantages handling of security and protection issues are unclear fault tolerance may be more difficult to achieve Choosing between DSM and message-passing is quite difficult Forin: “the amount of data exchanged between synchronization points is the main indicator to consider when deciding between the use of distributed shared memory and message passing in a parallel application”

Reference Book

Scheduling Parallel Jobs on Clusters
Introduction Background Rigid Jobs with Process Migration Malleable Jobs with Dynamic Parallelism Communication-Based Coscheduling Batch Scheduling Summary

Introduction (I) Clusters are increasingly being used for HPC application High cost of MPPs Wide availability of networked workstations and PCs How to add the HPC workload Original general-purpose workload on the cluster Not degrading the service of the original workload

Introduction (II) The issues to support for HPC applications
The acquisition of resources how to distinguish between workstations that are in active use and those that have available spare resources The requirement to give priority to workstation owners not cause noticeable degradation to their work The requirement to support the different styles of parallel programs place different constraints on the scheduling of their processes Possible use of admission control and scheduling policies regulate the additional HPC workload These issues are interdependent

Background (I) Cluster Usage Modes NOW (Network Of Workstations)
Based on tapping the idle cycles of existing resource Each machines has an owner Berkeley, Condor, and MOSIX When the owner is inactive, the resources become available for general use PMMPP (Poor Man’s MPP) Dedicated cluster acquired for running HPC application Less constraints regarding interplay between the regular workload and the HPC workload Beowulf project, RWC PC cluster, and ParPar Concentrate on scheduling in a NOW environment

Background (II) Job Types and Requirements
Job structure and the interactions types place various requirements on the scheduling system Three most common types Rigid jobs with tight coupling MPP environments a fixed number of processes communicate and synchronize at a high rate a dedicated partition of the machine for each job gang scheduling, time slicing is used, not dedicated

Background (III) Job Types and Requirements Three most common types
Rigid jobs with balanced processes and loose interactions not require that the processes execute simultaneously require that the processes progress at about the same rate Jobs structured as a workpile of independent tasks executed by a number of worker processes that takes from the workpile and execute them, possibly creating new tasks in the process a very flexible model that allows the number of workers to change at runtime leads to malleable jobs that are very suitable for NOW environment

Rigid Jobs with Process Migration
The subsystem responsible for the HPC applications doesn't have full control over the system Process migration involves the remapping of processes to processors during execution Reasons for migration The need to relinquish a workstation and return it to its owner The desire to achieve a balanced load on all workstations Metrics Overhead, detach degree Algorithmic aspects Which process to migrate Where to migrate process These decisions depend on the data the data that is available about the workload on the node Issues of load measurement and information dissemination

Case Study: PVM with Migration (I)
PVM is a software package for writing and executing parallel applications on a LAN Communication / Synchronization operations Configuration control Dynamic spawning of processes To create a virtual parallel machine A user spawns PVM daemon processes on a set of workstations Establish communication links among daemons Creating the infrastructure of the parallel machine PVM distributes its processes in round-robin manner among the workstations being used May create unbalanced loads and may lead to unacceptable degradation in the service

Communication in PVM is Mediated by a Daemon (pvmd) on Each Node
PVM Daemon Processes Local Processes

Case Study: PVM with Migration (II)
Several experimental version of PVM Migratable PVM and Dynamic PVM Include migration in order to move processes to more suitable locations PVM has been coupled with MOSIX and Condor to achieve similar benefits

Case Study: PVM with Migration (III)
Migration decisions By global scheduler Based on information regarding load and owner activity Four Steps The global scheduler notifies the responsible PVM daemon that one of its processes should be migrated The daemon then notifies all of the other PVM daemon about the pending migration The process state is transferred to the new location and a new process is created. The new process connects to the local PVM daemon in the new location, notifies all the other PVM daemons Migration in this system is asynchronous Can happen at any time Affects only the migrating process and other processes that may try to communicate with it

Migratable PVM mpvmd VP1 VP2 VP3 From GS: migrate VP1 to host2
Flush messages sent to VP1, Block sends to VP1 VP1’ Restart VP1’ Unblock sends to VP1 Code Data Heap Stack Host Host2 UDP State Transfer Migrating Process Skeleton Process    

Case Study: MOSIX (I) Multicomputer Operating System for unIX
Support adaptive resource sharing in a scalable cluster computing by dynamic process migration Based on a Unix kernel augmented with a process migration mechanism a scalable facility to distribute load information All processes enjoy about the same level of service Both sequential jobs and components of parallel jobs Maintains a balanced load on all the workstations in cluster

MOSIX Infrastructure Unix BSD Linux PPM ARSA
Preemptive Process Migration Adaptive Resource Sharing Algorithm

Case Study: MOSIX (II) The load information is distributed using a randomized algorithm Each workstation maintains a load vector with data about its own load and the other machine loads At certain intervals (e.g. once every minute), it sends this information to another, randomly selected machine With high probability, it will also be selected by some other machine, and receive such a load vector If it is found that some other machine has a significantly different load, a migration operation is initiated

Case Study: MOSIX (III)
Migration is based on the home-node concept Each process has a home node: it’s own workstation When it is migrated, it is split into two parts body, deputy The body contains all the user-level context site independent kernel context is migrated to another node The deputy contains site dependent kernel context is left on the home node A communication link is established between the two parts so that the process can access its local environment via the deputy, and so that other processes can access it Running PVM over MOSIX leads to improvements in performance

Process Migration in MOSIX
Site-dependent system calls body deputy Overloaded Workstation Underloaded Workstation Return values , signals Process migration in MOSIX divides the process into a migratable body and a site-dependent deputy, which remains in the home node.

Malleable Jobs with Dynamic Parallelism
Parallel jobs should adjust to resources reclaiming by owners at unpredictable times Emphasizes the dynamics of workstation clusters parallel jobs should adjust to such varying resources

Identifying Idle Workstations
Use only idle workstations Using idle workstation to run parallel jobs requires The ability to identify idle workstation Monitoring keyboard and mouse activity The ability to retreat from workstation when a workstation has to be evicted, the worker is killed and its tasks reassigned to other workers

Case Study: Condor and WoDi (1)
Condor is a system for running batch jobs in the background on a LAN, using idle workstations reclaimed by owner suspend the batch process restarted from a checkpoint on another node basis for the LoadLeveler product used on IBM workstations

Condor was augmented with a CARMI CARMI (Condor Application Resource Management Interface) allows jobs to request additional resources to be notified if resources are taken away

WoDi (Work Distributor) To support simple programming of master-worker applications The master process sends work requests (tasks) to the WoDi server

Master-Worker Applications using WoDi Server to Coordinate Task Execution
CARMI server worker Condor scheduler WoDi request Allocation results tasks spawn resource requests

Case Study: Piranha and Linda (1)
A parallel programming language A coordination language that can be added to Fortran or C Based on an associative tuple space that acts as a distributed data repository Parallel computations are created by injecting unevaluated tuples into the tuple space The tuple space can also be viewed as a workpile of independent tuples that need to be evaluated

Case Study: Piranha and Linda (2)
A system for executing Linda applications on a NOW Programs that run under Piranha must include three special user-defined functions feeder, piranha, and retreat The piranha function Executed automatically on idle workstation and transform the work tuples into result tuples(r) The feeder function generates work tuples(w) The retreat function the workstation is reclaimed by its owner

Piranha programs include a feeder function that generates work tuples(w), and piranha functions that are executed automatically on idle workstation and transform the work tuples into result tuples(r). If a workstation is reclaimed, the retreat function is called to return the unfinished work to tuple space Feeder piranha retreat w6 w8 r1 r3 w7 r2 w9 w4 r4 w5 tuple space reclaimed newly idle busy workstations

Communication-Based Coscheduling
If the processes of a parallel application communicate and synchronize frequently execute simultaneously on different processors Saves the overhead of frequent context switches Reduces the need of buffering during communication Combined with time slicing provided by gang scheduling Gang scheduling implies that the participating processes are known in advance The alternative is to identify them during execution only a sub-set of the processes are scheduled together, leading to coscheduling rather than gang scheduling

Demand-Based Coscheduling (1)
The decision about what processes should be scheduled together on actual observations of the communication patterns Requires the cooperation of the communication subsystem monitors the destination of incoming messages raises the priority of the destination process sender process may be coscheduled with destination process Problem raising the priority of any process that receives a message is that it is unfair multiple parallel jobs co-exist epoch numbers

Epoch numbers allow a parallel job to take over the whole cluster, in the face of another active job
spontaneous switch time 1 2

Demand-Based Coscheduling (2)
The epoch number on each node is incremented when a spontaneous context switch is made not the result of an incoming message This epoch number is appended to all outgoing messages. When a node receives a message, compares its local epoch number with the one in the incoming message Switch to the destination process only if the incoming epoch number is greater

Implicit Coscheduling (1)
Explicit control may be unnecessary Using Unix facilities (sockets) Unix process that perform I/O (including communication) get a higher priority processes participating in a communication phase will get high priority on their nodes without any explicit measures being taken Long phases of computation, intensive communication

Implicit Coscheduling (2)
Make sure that a communicating process is not de-scheduled while it is waiting for a reply from another process Using two-phase blocking (spin blocking) A waiting process will initially busy wait (spin) for some time, waiting for the anticipated response If the response does not arrive within the pre-specified time, the process blocks and relinquishes its processor in favor of another ready process Implicit coscheduling keeps processes in step only when they are communicating During the computation phase -> do not need to be coscheduled

Batch Scheduling Qualitative difference
Between the work done by workstation owners and the parallel jobs that try to use spare cycles Workstation owner do interactive work require immediate response Parallel Jobs compute-intensive run for long periods to queue them until suitable resources become available

Admission Controls (1) HPC application places a heavy load
Consideration for interactive users implies that these HPC applications be curbed if they hog the system Can refuse to admit them into the system in the first place

Admission Controls (2) A general solution: a batch scheduling system (ex. DQS, PBS ) Define a set of queues to which batch jobs are submitted contains jobs that are characterized by attributes such as expected run-time, memory requirements The batch scheduler chooses jobs for execution based on attributes and the available resources other jobs are queued so as not to overload the system To use only idle workstations vs. use all workstations, with preference for those that are lightly loaded

Case Study: Utopia/LSF (1)
Utopia is a environment for load sharing on large scale heterogeneous clusters a mechanism for collecting load information a mechanism for transparent remote execution a library to use them from applications Collection of load info done by a set of daemons, one on each node LIM (Load Information Manager) lowest host ID collects/distributes load vectors to all slave nodes recent CPU queue length, memory usage, the number of users the slaves can use this information to make placement decisions for new processes using a centralized master does not scale to large systems

Utopia uses a two-level design to spread load information in large systems
Node Strong Servers LIM s

Case Study: Utopia/LSF (2)
Support for load sharing across clusters is provided by communication among the master LIMs of the different clusters Possible to create virtual clusters that group together powerful servers that are physically dispersed across the system Utopia’s batch scheduling system Queueing and allocation decisions are done by a master batch daemon, which is co-located with the master LIM The actual execution and control over the batch processes Done by slave batch daemons on the various nodes

Summary (1) It is most important
to balance the loads on the different machines all processes get equal services not to interfere with workstations owners using idle workstation to provide parallel programs with a suitable environment simultaneous execution of interacting processes not to flood the system with low-priority compute intensive jobs admission controls and batch scheduling are necessary

Summary (2) There is room for improvement
by considering how to combine multiple assumptions and merge the approaches used in different systems The following combination is possible Have a tunable parameter that selects whether workstations are in general shared, or used only if idle Provide migration to enable jobs to evacuate workstations that become over loaded, are reclaimed by their owner, or become too slow relative to others parallel job Provide communication-based coscheduling for jobs that seems to need it, better not to require the user to specify it Provide batch queueing with a check-point facility so that heavy jobs will run only when they do not degrade performance for others

Load Sharing and Fault Tolerance Manager
Introduction Load Sharing in Cluster Computing Fault Tolerance by Means of Checkpointing Integration of Load Sharing and Fault Tolerance Related Works Conclusion

Introduction (1) Increasing need for processing power for scientific calculations Large development of distributed information system 등장 배경 워크 스테이션 네트웍을 이용하여 컴퓨테이션을 분산하는 방식이 등장한 배경 Makes the computation distribution over workstation networks very attractive

Introduction (2) Idle workstations 33% ~ 78% are idle time
18 workstations over a six-month period workstations were free 69% of the time processors were free 93% of the time

Tow load sharing policies
Introduction (3) To achieve high performance To provide an efficient mechanism for program allocation Tow load sharing policies 결국, 고성능 컴퓨테이션을 이루기 위해서는, 이러한 idle 워크스테이션을 효과적으로 이용할 수 있도록 프로그램을 할당 할 수 있는 효과적인 메커니즘이 필요하다. Dynamic placement Migration

Introduction (4) Dynamic placement Migration
Allocate programs according to the current system state Migration Move processes according to system and application evolution

Introduction (5) Load sharing is dedicated to long-lived application
Hardware failure, reboot, disconnected from the network Fault tolerance becomes an essential component

Introduction (6) Main scope of this chapter
Interaction between load balancing and fault tolerance in a cluster The combination of the two facilities can outperform reliable parallel applications

Load Sharing in Cluster Computing (1)
Optimum load sharing strategies are very hard to implement and are well-known as NP-complete It is impossible to get an exact knowledge of the global state of the system at a given instant The execution of complex allocation algorithms may represent a heavy burden The instability behavior of allocation algorithms is due to the inaccurate information Instability : 불안정 3번째는 1번이 원인이다...

Load Sharing in Cluster Computing (2)
Parallel program placement Static allocation Adapted for multiprocessor system PVM Dynamic load sharing Dynamic placement processes are allocated at start-up and stay on the same location Process migration processes can move according to overhead conditions or the reactivation of workstations by their owners Static allocation : 단점은 즉, 동적 로드 발란싱이 안된다. Dynamic placement : start-up 때 프로세스가 할당되고, 동일한 위치에서 머문다. Process migration : 오버로드 조건이나 reboot등으로 프로세스가 이동한다.

Checkpointing a Single Process
A snapshot of the process’s address space at a given time To reduce the cost of checkpointing Incremental method reduce the amount of data that must be written Non-blocking method allow the process to continue executing while its checkpoint is written to stable storage internal copy-on-write memory protection Snapshot을 저장 Incremental method 마지막 체크 포이트에서 변화된 부분만 write Non-blocking method checkpoint 를 하면서도 프로세스가 계속실행 할 수 있다. Copy-on-write 체크포인트 도중에 프로세스가 메모리페이지를 변경하면 정확한 상태를 나타낼 수 없다. 체크포인트 중에는 메모리페이지를 보호하기 위한 메커니즘이 필요하다. 만약 체크포인트 메모리 페이지를 변경하려고 하면 이 페이지를 새로운 페이지에 복사하고, 원본의 프로텍션을 제거된다. Checkpoint manager가 복사본을 종료한다.

Checkpointing of Communicating Processes
To recover from a fault the execution must be rolled back to a consistent global state Rolling back of one process is result in Fault을 recovery하기 위해서는, 일관된 global 상태로 roolback 되어야 한다. 즉, 자신이 roolback이 전체 시스템의 일관성을 유지하기 위해 다른 프로세스들의 roolback을 유발시킨다. Rollbacks of other processes

Domino Effect

Checkpointing Techniques (1)
Coordinated checkpointing Processes coordinate their checkpointing actions such that the collection of checkpoints represents a consistent state of the whole system Drawback the messages used for synchronizing a checkpoint are important source of overhead surviving processes may have to rollback to their latest checkpoint in order to remain consistent with recovering processes Analyzing the interactions between processes can reduce the number of processes to rollback Checkpointing 기술은 두가지로 분류될 수 있다.

Independent checkpointing Each process independently saves its state without any synchronization with the others Message logging to avoid the domino effect At each new checkpoint all messages are deleted from the associated backup The main drawback is the input/output overhead Checkpointing 기술은 두가지로 분류될 수 있다.

Independent checkpointing Two classes of logging message Pessimistic message logging Write incoming messages to stable storage before delivering to the application To restart, the failing process is rolled back to the last checkpoint and replies to outgoing messages are returned immediately from the log Optimistic message logging Messages are tagged with the information that allows the system to keep track of the inter-process dependencies Reduces logging overhead, but processes that survive a failure may still be rolled back Alternatively, messages are gathered in the main memory of the sending host and are asynchronously saved to stable storage Checkpointing 기술은 두가지로 분류될 수 있다.

Integration of Load Sharing and Fault Tolerance
GATOSTAR Fault tolerant load sharing facility GATOS is a load sharing manager which automatically distributes parallel applications among hosts according to multicriteria algorithms STAR is a software fault manager which automatically recovers processes affected by host failures GATOS STAR GATOSTAR

Environment and Architecture (1)
System Environment BSD Unix OS (SunOS) Works on a set of heterogeneous workstations connected by a LAN Migration domains of compatible hosts, where processes can freely move The average crash time in a LAN to be once every 2.7 days for a LAN of 10 workstations Fail-silent processor the faulty nodes simple stop and the remote nodes are not notified Migration domains : 이 범위에서 프로세스는 자유롭게 이동할 수 있다.

Application model Application is a set of communicating processes connected by a precedence graph The graph contains all qualitative information needed to execute the application and helpful for the load sharing manager precedence and communication between processes files in use Application running for extended periods of time high number factoring, VLSI application, image processing the failure probability becomes significant load sharing and the need for reliability are important concerns Precedence graph에는 애플리케이션을 실행하고 load sharing manager에 도움이 되는 모든 정보를 포함하고 있다.

GATOSTAR architecture A ring of hosts which exchange information about hosts functioning and processes’ execution Automatically recover the system and quickly return it to operation, thus increasing the availability of the distributed system resources Each host maintains a load vector A view of all host loads Periodically, each host sends its vector to its immediate successor Load messages are also used to detect host failure

GATOSTAR architecture Four daemons running on each host LSM - a load sharing manager in charge of allocation and migration strategies FTM - a fault tolerance manager responsible of checkpointing and message redirection RM - a ring manager in charge of load transmission and failure detection RFM - a replicated file manager implements the reliable storage LSM - allocation 과 migration 전략을 책임진다. FTM - checkpointing 과 message redirection RM - load 전송 failure detection RFM - reliable storage

Gatostar Architecture

Process Allocation (1) Process Placement
To evaluate the need of each process keep track of previous executions and/or we allow application designers to give appropriate indications Dynamic process allocation can be operated according to the following criteria hosts load, process execution time, required memory, and communication between processes In the application description, the programmer can specify appropriate allocation criteria for each program according to its execution needs Specifies the set of hosts involved in the load distribution and provides different binary code versions for each program in order to deal with system heterogeneity

Process Allocation (2) Selecting the less loaded host
Allocating processes according to load allows us to benefit from idle workstation power Load of host directly proportional to CPU utilization inversely proportional to its CPU speed On each host, the load sharing manager (LSM) locally computes an average load and uses failure detection messages to exchange load information The program is allocated to the most lightly loaded host, if the load of this host is below the overload threshold Otherwise, the program is started on the originating host as all hosts in the system are heavily loaded, or used by interactive users

Process Allocation (3) Selecting the less loaded host Main difficulty
Find a good value for the overload threshold Get global state information Every host cannot have exact knowledge about the load of the other hosts network transmission time delay between the instant when a host receives the information and the instant it uses the information to decide where to allocate a process Large load fluctuations can occur unexpectedly If the delays mentioned above are significant, a program can be allocated to a host, which becomes heavily loaded Re-directing the program allocation request, if the load of selected host is higher than the overload threshold

Process Allocation (4) Multicriteria load sharing algorithm
Response time aims at minimizing the total execution time of the whole application the selected hosts are allocated to programs in the order of their decreasing execution time values the longest programs are allocated on the less loaded and fastest hosts 애플리케이션 실행과 관련된 여러 기준을 고려하여 프로그래머가 program placement 정책을 결정할 수 있다. Application description 에 프로그래머가 각 프로그램에 적절한 성능 기준 (criteria) 을 설정할 수 있다. GATOSTAR에서 제안하는 표준 response time : 어플리케이션의 전체 실행 시간을 최소화 하는데 목적이 있다. Parallel program들을 배치작업으로 실행한다. 가장 긴 실행시간을 갖는 프로그램을 먼저 실행시킨다.. 계속 내림 차순으로 하도록 제안하는 것이다. File access :전체 파일 액세스 시간을 최소화 하도록 프로그램을 할당. 액세스 수, 파일 액세스 크기, 디스크와 네트웍 속도등 여러 파라미터를 기반으로 결정하고, 그 이후에 migration 이나 remote access 로 나머지 파일을 참조 communication between programs : 전체 통신 비용을 최소화, 이미 시스템에 할당된 프로그램들의 위치 정보를 가지고 있어야 한다. Memory need : batch queue system을이용 : 적당한 메모리 공간이 제공될 때까지 기다린다. 애플리케이션이 올바르게 실행을 보장하지만, response time이 증가된다. Performance criteria combination : 동시에 위의 여러정책을 조합하고 이를 적용하여 가장 적당한 호스트를 선택하여 프로그램을 할당한다

File accesses finding the host where a program should be allocated so that the total cost of file accesses is minimized the allocation decision is based on a number of parameters number of access, size of file accesses, disks and network speed to decide whether to migrate one or more of the remaining files or to access them remotely (according to the previous parameters and the file size) 애플리케이션 실행과 관련된 여러 기준을 고려하여 프로그래머가 program placement 정책을 결정할 수 있다. Application description 에 프로그래머가 각 프로그램에 적절한 성능 기준 (criteria) 을 설정할 수 있다. GATOSTAR에서 제안하는 표준 response time : 어플리케이션의 전체 실행 시간을 최소화 하는데 목적이 있다. Parallel program들을 배치작업으로 실행한다. 가장 긴 실행시간을 갖는 프로그램을 먼저 실행시킨다.. 계속 내림 차순으로 하도록 제안하는 것이다. File access :전체 파일 액세스 시간을 최소화 하도록 프로그램을 할당. 액세스 수, 파일 액세스 크기, 디스크와 네트웍 속도등 여러 파라미터를 기반으로 결정하고, 그 이후에 migration 이나 remote access 로 나머지 파일을 참조 communication between programs : 전체 통신 비용을 최소화, 이미 시스템에 할당된 프로그램들의 위치 정보를 가지고 있어야 한다. Memory need : batch queue system을이용 : 적당한 메모리 공간이 제공될 때까지 기다린다. 애플리케이션이 올바르게 실행을 보장하지만, response time이 증가된다. Performance criteria combination : 동시에 위의 여러정책을 조합하고 이를 적용하여 가장 적당한 호스트를 선택하여 프로그램을 할당한다

Communication between programs aims to minimize the total cost of communications between programs to reduce the number of remote communications keep information of the location of all the application’s programs already allocated in the system 애플리케이션 실행과 관련된 여러 기준을 고려하여 프로그래머가 program placement 정책을 결정할 수 있다. Application description 에 프로그래머가 각 프로그램에 적절한 성능 기준 (criteria) 을 설정할 수 있다. GATOSTAR에서 제안하는 표준 response time : 어플리케이션의 전체 실행 시간을 최소화 하는데 목적이 있다. Parallel program들을 배치작업으로 실행한다. 가장 긴 실행시간을 갖는 프로그램을 먼저 실행시킨다.. 계속 내림 차순으로 하도록 제안하는 것이다. File access :전체 파일 액세스 시간을 최소화 하도록 프로그램을 할당. 액세스 수, 파일 액세스 크기, 디스크와 네트웍 속도등 여러 파라미터를 기반으로 결정하고, 그 이후에 migration 이나 remote access 로 나머지 파일을 참조 communication between programs : 전체 통신 비용을 최소화, 이미 시스템에 할당된 프로그램들의 위치 정보를 가지고 있어야 한다. Memory need : batch queue system을이용 : 적당한 메모리 공간이 제공될 때까지 기다린다. 애플리케이션이 올바르게 실행을 보장하지만, response time이 증가된다. Performance criteria combination : 동시에 위의 여러정책을 조합하고 이를 적용하여 가장 적당한 호스트를 선택하여 프로그램을 할당한다

Memory needs the host on which a program will be allocated must have enough memory to run it by determining the memory needs of the program to be allocated and the available memory on every host program allocation on one host must not exceed the maximum physical memory of this host The memory allocation criterion in GATOSTAR Programs are managed by a batch queue system, waiting for the appropriate memory space to be available this ensures the correct execution of the application but may increase its response time 애플리케이션 실행과 관련된 여러 기준을 고려하여 프로그래머가 program placement 정책을 결정할 수 있다. Application description 에 프로그래머가 각 프로그램에 적절한 성능 기준 (criteria) 을 설정할 수 있다. GATOSTAR에서 제안하는 표준 response time : 어플리케이션의 전체 실행 시간을 최소화 하는데 목적이 있다. Parallel program들을 배치작업으로 실행한다. 가장 긴 실행시간을 갖는 프로그램을 먼저 실행시킨다.. 계속 내림 차순으로 하도록 제안하는 것이다. File access :전체 파일 액세스 시간을 최소화 하도록 프로그램을 할당. 액세스 수, 파일 액세스 크기, 디스크와 네트웍 속도등 여러 파라미터를 기반으로 결정하고, 그 이후에 migration 이나 remote access 로 나머지 파일을 참조 communication between programs : 전체 통신 비용을 최소화, 이미 시스템에 할당된 프로그램들의 위치 정보를 가지고 있어야 한다. Memory need : batch queue system을이용 : 적당한 메모리 공간이 제공될 때까지 기다린다. 애플리케이션이 올바르게 실행을 보장하지만, response time이 증가된다. Performance criteria combination : 동시에 위의 여러정책을 조합하고 이를 적용하여 가장 적당한 호스트를 선택하여 프로그램을 할당한다

Performance criteria combination to ensure program placement according to several policies at the same time take at the beginning the set of available hosts, then, this set is reduced according to the first criterion on the given list of criteria this mechanism is repeated with the remaining criteria until only one host remains or all the refinements have been done multicriteria approach is very efficient for a general-purpose load sharing system it can adapt the allocation to a wide spectrum of applications and programs behaviors 애플리케이션 실행과 관련된 여러 기준을 고려하여 프로그래머가 program placement 정책을 결정할 수 있다. Application description 에 프로그래머가 각 프로그램에 적절한 성능 기준 (criteria) 을 설정할 수 있다. GATOSTAR에서 제안하는 표준 response time : 어플리케이션의 전체 실행 시간을 최소화 하는데 목적이 있다. Parallel program들을 배치작업으로 실행한다. 가장 긴 실행시간을 갖는 프로그램을 먼저 실행시킨다.. 계속 내림 차순으로 하도록 제안하는 것이다. File access :전체 파일 액세스 시간을 최소화 하도록 프로그램을 할당. 액세스 수, 파일 액세스 크기, 디스크와 네트웍 속도등 여러 파라미터를 기반으로 결정하고, 그 이후에 migration 이나 remote access 로 나머지 파일을 참조 communication between programs : 전체 통신 비용을 최소화, 이미 시스템에 할당된 프로그램들의 위치 정보를 가지고 있어야 한다. Memory need : batch queue system을이용 : 적당한 메모리 공간이 제공될 때까지 기다린다. 애플리케이션이 올바르게 실행을 보장하지만, response time이 증가된다. Performance criteria combination : 동시에 위의 여러정책을 조합하고 이를 적용하여 가장 적당한 호스트를 선택하여 프로그램을 할당한다

Process Allocation (9) Process Migration
Three main questions that must be answered when to migrate a process a migration threshold which process to choose local server chooses a process that has been executed locally for at least a given local execution time where to move the chosen process a reception threshold Migration시에 결정해야할 3가지 요소 “when to migrate a process”는 a migration threshold를 기반으로 결정하고, “where to more the chosen process”는 a reception threshold를 기반으로 결정한다.

Process Allocation (10) Process Migration
Load sharing server periodically computes two conditions a faster workstation load is under the reception threshold the local host load is above the migration threshold and at least one workstation load is under the reception threshold 1번째; 빠른호스트를 이용할 수 있는 잠재성을 보여준다. 2번째: 로컬의 로드를 감소시킬수 있는 방법이 있다는 것을 보여준다.

Failure Management (1) Failure detection mechanism Recovery method
a structuring of hosts in a logical ring each host independently checks its successor Recovery method checkpoint/rollback management user processes save their states to prevent their local host failures independent checkpointing with a pessimistic message logging adapted for application composed of processes exchanging small streams of data

Failure Management (2) Recovery method Advantages
the domino-effect is eliminated checkpoint operation can be performed unilaterally by each process and any checkpointing policy may be used only one checkpoint is associated to each process checkpoint cost is lower than in a consistent checkpointing and recovery is implemented efficiently because all interprocess communications do not take place during a replay GATOSTAR implements incremental and non-blocking checkpointing

Performance Study (1) Cost of initial placement
Hosts selection is done locally when starting the application, and the only exchange of message is the request from the local load sharing manager to the remote one to execute a given program Cost of process migration The time for a process migration is proportional to the amount of data to be transferred from the source host to the selected one The size of each test process is less than 100 kilobytes Migration time of these processes is less than 3 seconds

Migration Time According to Process Size

Performance Study (2) Process allocation
The execution time of six parallel programs running on six hosts Compare the non-preemptive placement strategy with the migration Each program is computing a given number of iterations, from 50 million to 500 million There is no communication between programs

Execution Time of Six Programs According to the Number of Iterations

Speedup According to the Number of Hosts

Speedup According to Application Complexity

Performance Study (3) Extrapolation by simulation
Fixed threshold vs adaptive threshold An allocation manager should adapt the threshold values according to the global load and the behavior of the current application

Simulation of Response Time According to the Threshold Policy

Simulation of Response Time According to the Overload Threshold Value

Performance Study (4) Performance of checkpointing
Three long-running, compute-intensive applications exhibiting different memory usage and communications patterns

Parallel Applications Evaluation
Independent checkpointing and pessimistic message logging 2-minute checkpointing interval Checkpointings and logs are duplicated

Related Works REM, Paralex, Condor, DAWGS (Distributed Automated workload sharing System) Developed essentially as load sharing manager, since fault tolerance is a limited extension REM and Paralex are load sharing managers that support cooperative applications REM provides a mechanism by which processes may create, communicate, and terminate child processes on remote workstations Active process replication is used to ensure the fault tolerant execution of the child processes Failure of the local user’s machine implies a failure of the whole computation

Conclusion (1) Described the advantages of load sharing and fault tolerance unification Improve application’s response time Decrease the overhead of the software fault management GATOSTAR is a unified load sharing and fault tolerance facility Advantage of the overall system resources and allows the distribution of the applications processes on the most appropriate hosts Process allocation policy takes into account both information about the system state and resource need for the execution of applications A checkpoint mechanism allows the recovery of processes after hosts’ failures and is used to move processes in case of load fluctuations

Conclusion (2) Show the benefit of the load sharing and of the multi-criteria allocation algorithms in a LAN of workstations The combination of dynamic placement and process migration outperforms dynamic placement only Dynamic placement allows us to use the available machine at launch time at very low cost (one request) For long-lived applications, process migration allows to correct this initial placement according to failures or changes of the program behaviors, in the hosts load, or in interactive users activities

Beowulf Why Beowulf? Provenance Setting Poetic devices Terms Themes

3. Beowulf is simply good writing
Why Study Beowulf? 1. Beowulf is the oldest poem in the English language, so everything written since Beowulf stems from it in some way 2. The story of Beowulf encompasses common themes that we still see in English literature today 3. Beowulf is simply good writing

5. Studying Old English improves your understanding of modern English
Why Study Beowulf? 4. In some ways, it doesn’t matter what you read, but how you read it, so…since Beowulf came first, you might as well start there. 5. Studying Old English improves your understanding of modern English 6. It’s a great story

Beowulf’s Provenance What we don’t know: who wrote it
when exactly it was written how much, exactly, is based on historical truth

Some of the characters in the poem actually existed.
Beowulf’s Provenance What we do know: Beowulf is the oldest surviving English poem. It’s written in Old English (or Anglo-Saxon), which is the basis for the language we speak today. Some of the characters in the poem actually existed. The only copy of the manuscript was written sometime around the 11th century A.D. (1000’s), however…

The story may be set even earlier, around 500 A.D.
The actual poem probably dates from the 8th century (700’s) or so, and… The story may be set even earlier, around 500 A.D. There are a lot of Christian references in the poem, but the characters and setting are Pagan…this means a monk probably translated it.

So why wasn’t it written down in the first place?
Beowulf’s Provenance So why wasn’t it written down in the first place? This story was probably passed down orally for centuries before it was first written down. It wasn’t until after the Norman Invasion (1066) that writing stories down became common in this part of the world.

Beowulf’s Provenance So what’s happened to the manuscript since the 11th century? Eventually, it ended up in the library of this guy. Robert Cotton ( )

Beowulf’s Provenance Unfortunately, Cotton’s library burned in Many manuscripts were entirely destroyed. Beowulf was partially damaged. The manuscript is now preserved and carefully cared for in the British Museum.

Setting: Beowulf’s time and place
Although Beowulf was written in English, it is set in what is now Sweden, where a tribe called the Geats lived. The story may take place as early as 400 or 500 A.D.

Setting: Beowulf’s time and place Insert: Time of Beowulf
Europe today

How we date Beowulf Some Important Dates:
521 A.D. – death of Hygelac, who is mentioned in the poem 680 A.D. – appearance of alliterative verse 835 A.D. – the Danish started raiding other areas; after this, few poets would consider them heroes SO: This version was likely composed between 680 and 835, though it may be set earlier

The Poetry in Beowulf 1. Alliterative verse
A few things to watch out for 1. Alliterative verse Repetition of initial sounds of words (occurs in every line) b. Generally, four feet/beats per line c. A caesura, or pause, between beats two and four d. No rhyme

The Poetry in Beowulf Alliterative verse – an example from Beowulf:
A few things to watch out for Alliterative verse – an example from Beowulf: Oft Scyld Scefing sceapena praetum, Monegum maegpum meodo-setla ofteah; Egsode Eorle, syddan aerest weard.

The Poetry in Beowulf A few things to watch out for There was Shield Sheafson, scourge of many tribes, A wrecker of mead-benches, rampaging among foes. The terror of the hall-troops had come far.

The Poetry in Beowulf 2. Kennings
A few things to watch out for 2. Kennings a. Compound metaphor (usually two words) b. Most were probably used over and over For instance: hronade literally means “whale-road,” but can be translated as “sea”

The Poetry in Beowulf Other kennings from Beowulf:
A few things to watch out for Other kennings from Beowulf: banhus = “bone-house” = body goldwine gumena = “gold-friend of men” = generous prince beaga brytta = “ring-giver” = lord beadoleoma = “flashing light” = sword

The Poetry in Beowulf 3. Litotes
A few things to watch out for 3. Litotes A negative expression; usually an understatement Example: Hildeburh had no cause to praise the Jutes In this example, Hildeburh’s brother has just been killed by the Jutes. This is a poetic way of telling us she hated the Jutes absolutely.

Some terms you’ll want to know
scop A bard or story-teller. The scop was responsible for praising deeds of past heroes, for recording history, and for providing entertainment

comitatus Literally, this means “escort” or “comrade” This term identifies the concept of warriors and lords mutually pledging their loyalty to one another

thane A warrior mead-hall The large hall where the lord and his warriors slept, ate, held ceremonies, etc.

wyrd Fate. This idea crops up a lot in the poem, while at the same time there are Christian references to God’s will.

epic Beowulf is an epic poem. This means it has a larger-than life hero and the conflict is of universal importance. There’s a certain serious that accompanies most epics.

elegy An elegy is a poem that is sad or mournful. The adjective is elegiac. homily A homily is a written sermon or section of the poem that gives direct advice.

Themes and Important Aspects
Good vs. Evil Religion: Christian and Pagan influences The importance of wealth and treasure The importance of the sea and sailing The sanctity of the home Fate Loyalty and allegiance Heroism and heroic deeds

High Performance Cluster Computing: Architectures and Systems

Similar presentations

Presentation on theme: "High Performance Cluster Computing: Architectures and Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Performance Cluster Computing: Architectures and Systems

Similar presentations

Presentation on theme: "High Performance Cluster Computing: Architectures and Systems"— Presentation transcript:

Similar presentations

About project

Feedback