Early Experience with Out-of-Core Applications on the Cray XMT Daniel Chavarría-Miranda §, Andrés Márquez §, Jarek Nieplocha §, Kristyn Maschhoff † and.

Slides:

Advertisements

Similar presentations

Processes Management.

Advertisements

Fabián E. Bustamante, Spring 2007

Performance of Cache Memory

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Chapter 6 Computer Architecture

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.

Distributed systems Programming with threads. Reviews on OS concepts Each process occupies a single address space.

OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.

CS 550 Amoeba-A Distributed Operation System by Saie M Mulay.

CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.

Cache Oblivious Search Trees via Binary Trees of Small Height

1 Chapter 4 Threads Threads: Resource ownership and execution.

A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.

Massive Streaming Data Analytics: A Case Study with Clustering Coefficients David Ediger Karl Jiang Jason Riedy David A. Bader Georgia Institute of Technology.

Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,

Multithreaded Systems for Emerging High-Performance Applications 2008 DOE Summer School in Multiscale Mathematics and High-Performance Computing Daniel.

File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.

Department of Computer and Information Science, School of Science, IUPUI Dale Roberts, Lecturer Computer Science, IUPUI CSCI.

Synchronization (Barriers) Parallel Processing (CS453)

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Synchronization and Communication in the T3E Multiprocessor.

Computers Central Processor Unit. Basic Computer System MAIN MEMORY ALUCNTL..... BUS CONTROLLER Processor I/O moduleInterconnections BUS Memory.

Chapter 1. Introduction What is an Operating System? Mainframe Systems

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Graph Algorithms for Irregular, Unstructured Data John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory July, 2010.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Sujayyendhiren RS, Kaiqi Xiong and Minseok Kwon Rochester Institute of Technology Motivation Experimental Setup in ProtoGENI Conclusions and Future Work.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Chapter 10: File-System Interface Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Jan 1, 2005 Chapter 10: File-System.

CIS250 OPERATING SYSTEMS Memory Management Since we share memory, we need to manage it Memory manager only sees the address A program counter value indicates.

Page 110/19/2015 CSE 30341: Operating Systems Principles Chapter 10: File-System Interface  Objectives:  To explain the function of file systems  To.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.

The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Computer Organization & Assembly Language © by DR. M. Amer.

File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.

Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.

Disk & File System Management Disk Allocation Free Space Management Directory Structure Naming Disk Scheduling Protection CSE 331 Operating Systems Design.

Lecture on Central Process Unit (CPU)

Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

Lec 5 part2 Disk Storage, Basic File Structures, and Hashing.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Introduction to Computers - Hardware

Memory COMPUTER ARCHITECTURE

Course Introduction Dr. Eggen COP 6611 Advanced Operating Systems

Andy Wang COP 5611 Advanced Operating Systems

CSCI206 - Computer Organization & Programming

NVIDIA Fermi Architecture

Directory Structure A collection of nodes containing information about all files Directory Files F 1 F 2 F 3 F 4 F n Both the directory structure and the.

Threads Chapter 4.

Andy Wang COP 5611 Advanced Operating Systems

Andy Wang COP 5611 Advanced Operating Systems

Database System Architectures

CS703 - Advanced Operating Systems

Andy Wang COP 5611 Advanced Operating Systems

Lu Tang , Qun Huang, Patrick P. C. Lee

CSE 542: Operating Systems

Presentation transcript:

Early Experience with Out-of-Core Applications on the Cray XMT Daniel Chavarría-Miranda §, Andrés Márquez §, Jarek Nieplocha §, Kristyn Maschhoff † and Chad Scherrer § § Pacific Northwest National Laboratory (PNNL) † Cray, Inc.

2 IntroductionIntroduction Increasing gap between memory and processor speed Causing many applications to become memory-bound Mainstream processors utilize cache hierarchy Caches not effective for highly irregular, data-intensive applications Multithreaded architectures provide an alternative Switch computation context to hide memory latency Cray MTA-2 processors and newer ThreadStorm processors on the Cray XMT utilize this strategy

3 Cray XMT 3 rd generation multithreaded system from Cray Infrastructure is based on XT3/4, scalable up to 8192 processors SeaStar network, torus topology, service and I/O nodes Compute nodes contain 4 ThreadStorm multithreaded processors instead of 4 AMD Opteron processors Hybrid execution capabilities: code can run on ThreadStorm processors in collaboration with code running on Opteron processors

4 Cray XMT (cont.) ThreadStorm processors run at 500 MHz 128 hardware thread contexts, each with its own set of 32 registers No data cache 128KB, 4-way associative data buffer on the memory side Extra bits in each 64-bit memory word: full/empty for synchronization Hashed memory at a 64-byte level, i.e. contiguous logical addresses at a 64-byte boundary might be mapped to uncontiguous physical locations Global shared memory

5 Cray XMT (cont.) Lightweight User Communication library (LUC) to coordinate data transfers and hybrid execution between ThreadStorm and Opteron processors Portals-based on Opterons Fast I/O API-based on ThreadStorms RPC-style semantics Service and I/O (SIO) nodes provide Lustre, a high- performance parallel file system ThreadStorm processors cannot directly access Lustre LUC-based execution and transfers combined with Lustre access on the SIO nodes Attractive and high-performance alternative for processing very large datasets on the XMT system

6 OutlineOutline Introduction Cray XMT  PDTree Multithreaded implementation Static & dynamic versions Experimental setup and Results Conclusions

PDTree (or Anomaly Detection for Categorical Data) 7 Originates from cyber security analysis Detect anomalies in packet headers Locate and characterize network attacks Analysis method is more widely applicable Uses ideas from conditional probability Multivariate categorical data analysis For a combination of variables and instances of values for these variables, find out how many times the pattern has occurred Resulting count table or contingency table specifies a joint distribution Efficient implementation of algorithms using such tables are very important in statistical analysis ADTree data structure (Moore & Lee 1998), can be used to store data counts Stores all combinations of values for variables

PDTree (cont.) 8 We use an enhancement to the ADTree data structure called a PDTree where we don’t need to store all possible combinations of values Only store a priori specified combinations

Multithreaded Implementation 9 PDTree implemented using a multiple type, recursive tree structure Root node is an array of ValueNodes (counts for different value instances of the root variables) Interior and leaf nodes are linked lists of ValueNodes Inserting a record at the top level involves just incrementing the counter of the corresponding ValueNode XMT’s int_fetch_add() atomic operation is used to increment counters Inserting a record at other levels requires the traversal of a linked list to find the right ValueNode If the ValueNode does not exist, it must be appended to the end of the list Inserting at other levels when the node does not exist is tricky To ensure safety the end pointer of the list must be locked Use readfe() and writeef() MTA operations to create critical sections Take advantage of full/empty bits on each memory word As data analysis progresses the probability of conflicts between threads is lower

Multithreaded Implementation (cont.) 10 v i = j (count) v i = k (count) T 2 trying to grab the end pointer T 1 trying to grab the end pointer v i = j (count) v i = k (count) T 2 now has a lock to a non-end pointer T 1 succeeded and inserted a new node v i = m (count)

Static and Dynamic Versions 11 column = a numCols = 3 values = RootNode count = 5 columns = column = b values = column = c values =... value = 10 count = 1 numCols = 3 columns =... nextVN = value = 19 count = 4 numCols = 3 columns =... nextVN =... count = 3 columns = column = b values = column = c values =... Linked list of ValueNodes Hash table of ValueNodes Array of ColumnNodes Array of RootNodes

12 OutlineOutline Introduction Cray XMT PDTree Multithreaded Implementation Static & dynamic versions  Experimental setup and Results Conclusions

Experimental setup and Results Large dataset to be analyzed by PDTree 4 GB resident on disk (64M records, 9 column guide tree) Options: Direct file I/O from ThreadStorm procesors via NFS Not very efficient Indirect I/O via LUC server running on Opteron processors on the SIO nodes Large input file can reside on high-performance Lustre file system Simulates the use of PDTree for online network traffic analysis Need to use dynamic PDTree 128K element hash table 13

Experimental setup and Results (cont.) 14 Threadstorm CPU Threadstorm CPU Threadstorm CPU Threadstorm CPU DRAM Opteron CPU DRAM Service/login node Compute node SeaStar Interconnect Lustre filesystem Direct Access Indirect Access LUC RPC Note: results obtained on a preproduction XMT with only half of the DIMM slots populated

Experimental setup and Results (cont.) 15 # of procs. XMT Insertion XMT Speedup MTA Insertion MTA Speedup N/A N/A In-core, 1M record execution, static PDTree version

Experimental setup and Results (cont.) 16

Experimental setup and Results (cont.) 17

18 Conclusions Results indicate the value of the XMT hybrid architecture and its improved I/O capabilities Indirect access to Lustre through LUC interface Need to improve I/O operation implementation to take full advantage of Lustre Multiple LUC transfers in parallel should improve performance Scalability of the system is very good for complex, data-dependent irregular accesses in the PDTree application Future work includes comparisons against parallel cache-based systems