SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Building and Running Parallel Simulations.

Slides:



Advertisements
Similar presentations
National Institute of Advanced Industrial Science and Technology Ninf-G - Core GridRPC Infrastructure Software OGF19 Yoshio Tanaka (AIST) On behalf.
Advertisements

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Intermediate GPGPU Programming in CUDA
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Back-end Timing Models Core Models.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Presented by: Yash Gurung, ICFAI UNIVERSITY.Sikkim BUILDING of 3 R'sCLUSTER PARALLEL COMPUTER.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Figure 1.1 Interaction between applications and the operating system.
Timm M. Steinbeck - Kirchhoff Institute of Physics - University Heidelberg 1 Timm M. Steinbeck HLT Data Transport Framework.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Back-end Timing Models Core Models.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
OMNET++. Outline Introduction Overview The NED Language Simple Modules.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Modeling and Parallel Simulation.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Enterprise Computing With Aspects of Computer Architecture Jordan Harstad Technology Support Analyst Arizona State University.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Artdaq Introduction artdaq is a toolkit for creating the event building and filtering portions of a DAQ. A set of ready-to-use components along with hooks.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
SystemC and Levels of System Abstraction: Part I.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Marcelo R.N. Mendes. What is FINCoS? A Java-based set of tools for data generation, load submission, and performance measurement of event processing systems;
MACCE and Real-Time Schedulers Steve Roberts EEL 6897.
1 Chapter 8 – Classes and Object: A Deeper Look Outline 1 Introduction 2 Implementing a Time Abstract Data Type with a Class 3 Class Scope 4 Controlling.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD QSim 1 : Overview  Thread safe multicore.
P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES Huiwei Lv, Yuan Cheng, Lu Bai, Mingyu Chen, Dongrui Fan, Ninghui Sun Institute of Computing.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.
Chapter 4 Message-Passing Programming. The Message-Passing Model.
Processor Architecture
An Investigation of Xen and PTLsim for Exploring Latency Constraints of Co-Processing Units Grant Jenks UCLA.
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Embedded Computer Architecture 5SAI0 Simulation - chapter 9 - Luc Waeijen 16 Nov.
Floating Point Numbers & Parallel Computing. Outline Fixed-point Numbers Floating Point Numbers Superscalar Processors Multithreading Homogeneous Multiprocessing.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Kraków4FutureDaQ Institute of Physics & Nowoczesna Elektronika P.Salabura,A.Misiak,S.Kistryn,R.Tębacz,K.Korcyl & M.Kajetanowicz Discrete event simulations.
1 CS 501 Spring 2003 CS 501: Software Engineering Lecture 23 Performance of Computer Systems.
 Program Abstractions  Concepts  ACE Structure.
Status & development of the software for CALICE-DAQ Tao Wu On behalf of UK Collaboration.
GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008.
ECHO A System Monitoring and Management Tool Yitao Duan and Dawey Huang.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Hardware Architecture
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
The Present and Future of Parallelism on GPUs
NFV Compute Acceleration APIs and Evaluation
Introduction to SimpleScalar
Performance Tuning Team Chia-heng Tu June 30, 2009
Hyperthreading Technology
Distributed computing deals with hardware
Introduction to Heterogeneous Parallel Computing
Portable SystemC-on-a-Chip
Client/Server and Peer to Peer
CMPE419 Mobile Application Development
Emulating Massively Parallel (PetaFLOPS) Machines
Presentation transcript:

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Building and Running Parallel Simulations 1 Instantiate Components Connect Components Register Clocks Simulation Functions Initialization Configuration parameters From Manifold Library Inputs (trace, QSIM, etc.) Instantiate Links Set Timing Behavior Time stepped vs. discrete event Set Duration, Cleanup, etc.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Building and Running Parallel Simulations Kernel Interface Simulator Construction Logs and Statistics Demos 2

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Kernel Interface Component functions create component  component can have 0-4 constructor arguments  template allows constructor parameters to be any type  returns unique integer ID Component::Create (lp, node_id, m_conf, cpuid, proc_settings); //component-decl.h template static CompId_t Create(LpId_t, CompName name=CompName(“none”));... template static CompId_t Create(LpId_t, const T1&, const T2&, const T3&, const T4&, CompName name=CompName(“none”)); 3

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Kernel Interface Connect components  one-way connection  two-way connection //manifold-decl.h template static void Connect(CompId_t srcComp, int srcIdx, CompId_t dstComp, int dstIdx, void (T::*handler)(int, T2), Ticks_t latency); //manifold-decl.h template static void Connect(CompId_t comp1, int idx1, void (T::handler1)(int, T2), CompId_t comp2, int idx2, void(U::*handler2)(int, U2), Clock& clk1, Clock& clk2, Ticks_t latency1, Ticks_t latency2); Source component Source component Destination component Destination component srcIdx dstIdx 4

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Kernel Interface Clock functions: constructor, Register() //clock.h Clock(double freq); template static tickObjBase* Register(Clock& clk, O* obj, void (O::*rising)(void) void (O::*falling)(void)); simulation functions //manifold-decl.h static void Init(int argc, char**argv, SchedulerType=TICKED, SyncAlg::SyncAlgType_t syncAlg=SyncAlg::SA_CMB_OPT_TICK, Lookahead::LookaheadType_t la=Lookahead::LA_GLOBAL); static void Finalize(); static void StopAt(Ticks_t stop); static void Run(); 5

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Simulator Construction Steps for building a simulation program – Call Manifold::Init() – Build system model: Clock() ; Create(), Connect(), Register() – Set simulation stop time: StopAt() – Call Manifold::Run() – Call Manifold::Finalize() – Print out statistics: print_stats() 6

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Logs and Statistics Each component collects its own statistics A convention for printing stats is: void print_stats(std::ostream&); 7

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Example Simulators Simulator 1: For demo purposes only Builds a 2-core system 2 Zesto cores MCP cache Iris(2x2 torus) CaffDRAM Runs sequential or parallel (3 LPs) simulation Simulator 2: Part of software distribution 3 programs: work with Qsim server, Qsim lib, and traces, respectively Core model can be replaced with one-line change to configure file 8

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Sample Results: Setup 16, 32, 64-core CMP models 2, 4, 8 memory controllers, respectively 5x4, 6x6, 9x8 torus, respectively Host: Linux cluster; each node has 2 Intel Xeon X core CPUs with 24 h/w threads 13, 22, 40 h/w threads used by the simulator on 1, 2, 3 nodes, respectively 200 Million simulated cycles in region of interest (ROI) Saved boot state and fast forward to ROI 9

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Sample Results: Simulation Time in Minutes 16-core32-core64-core Seq.Para.Seq.Para.Seq.Para. dedup (4.4X) (7.1X) (6.7X) facesim (5.4X) (8.6X) (9.3X) ferret (4.9X) (7.0X) (7.6X) freqmine (5.5X) (6.7X) (8.1X) stream (5.3X) (7.0X) (12.1X) vips (5.1X) (6.7X) (7.6X) barnes (4.6X) (6.0X) (11.1X) cholesky (5.2X) (6.5X) (10.6X) fmm (5.0X) (6.7X) (12.1X) lu (5.6X) (7.2X) (11.3X) radiosity (4.5X) (6.3X) (8.0X) water (4.2X) (5.9X) (7.2X) 10

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Sample Results: Simulation in KIPS 16-core32-core64-core Seq.Para.Seq.Para.Seq.Para. dedup facesim ferret freqmine stream vips barnes cholesky fmm lu radiosity water Mean Median

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Sample Results: KIPS per Hardware Thread 16-core32-core64-core Seq.Para.Seq.Para.Seq.Para. dedup facesim ferret freqmine stream vips barnes cholesky fmm lu radiosity water Mean Median

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Outline Introduction Execution Model and System Architecture Multicore Emulator Front-End Component Models Cores Network Memory System Building and Running Manifold Simulations Physical Modeling: Energy Introspector Some Example Simulators 13