P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES Huiwei Lv, Yuan Cheng, Lu Bai, Mingyu Chen, Dongrui Fan, Ninghui Sun Institute of Computing.

Slides:



Advertisements
Similar presentations
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Advertisements

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Optimistic Parallel Discrete Event Simulation Based on Multi-core Platform and its Performance Analysis Nianle Su, Hongtao Hou, Feng Yang, Qun Li and Weiping.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
Paul D. Bryan, Jason A. Poovey, Jesse G. Beu, Thomas M. Conte Georgia Institute of Technology.
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
PADS Conservative Simulation using Distributed-Shared Memory Teo, Y. M., Ng, Y. K. and Onggo, B. S. S. Department of Computer Science National University.
Parallel Simulation etc Roger Curry Presentation on Load Balancing.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Supporting GPU Sharing in Cloud Environments with a Transparent
DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab
McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Behavior of Synchronization Methods in Commonly Used Languages and Systems Yiannis Nikolakopoulos Joint work with: D. Cederman, B.
Roger Jones, Lancaster University1 Experiment Requirements from Evolving Architectures RWL Jones, Lancaster University Ambleside 26 August 2010.
Hardware Supported Time Synchronization in Multi-Core Architectures 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan,
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Para-Snort : A Multi-thread Snort on Multi-Core IA Platform Tsinghua University PDCS 2009 November 3, 2009 Xinming Chen, Yiyao Wu, Lianghong Xu, Yibo Xue.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Embedded System Lab. 최 길 모최 길 모 Kilmo Choi A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Para-Snort : A Multi-thread Snort on Multi-Core IA Platform Tsinghua University PDCS 2009 November 3, 2009 Xinming Chen, Yiyao Wu, Lianghong Xu, Yibo Xue.
Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.
Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.
Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Building and Running Parallel Simulations.
DTM and Reliability High temperature greatly degrades reliability
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Hybrid MPI/Pthreads Parallelization of the RAxML Phylogenetics Code Wayne Pfeiffer.
A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines Chao Mei 05/02/2008 The 6 th Charm++ Workshop.
Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Speedup for Multi-Level Parallel Computing School of Computer Engineering Nanyang Technological University 21 st May 2012 Shanjiang Tang, Bu-Sung Lee,
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
SSU 1 Dr.A.Srinivas PES Institute of Technology Bangalore, India 9 – 20 July 2012.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
NFV Compute Acceleration APIs and Evaluation
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Hyperthreading Technology
Linchuan Chen, Xin Huo and Gagan Agrawal
CPSC 531: System Modeling and Simulation
Department of Computer Science University of California, Santa Barbara
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
Memory Opportunity in Multicore Era
Verilog to Routing CAD Tool Optimization
EE 4xx: Computer Architecture and Performance Programming
Department of Computer Science University of California, Santa Barbara
CS Introduction to Operating Systems
Presentation transcript:

P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES Huiwei Lv, Yuan Cheng, Lu Bai, Mingyu Chen, Dongrui Fan, Ninghui Sun Institute of Computing Technology PADS 2010, May 18, 2010

Motivation Multi-core platforms are common now Courtesy: Sun® UltraSPARC T2 Courtesy: AMD® Opeteron 6000 Courtesy: Intel® Nehalem System Simulators still sequential

Motivation Multi-core platforms are common now courtesy: Sun® UltraSPARC T2 courtesy: AMD® Phenom courtesy: Intel® Nehalem System Simulators still sequential  Multi-core is wasted  Simulation speed is limited by single core performance

Poor Scalability of Single-threaded Simulator Slowdown grow exponentially Not able to simulate future many-core systems cores Too slow to simulate future many-cores

Goal: fast and accurate computer system simulation Functional Cycle Accuracy Speed (slowdown) Speed (slowdown) Speedup 10x without accuracy lost

Outline Motivation Implementation Background From DES to PDES Optimization Evaluation Conclusion

Godson-T Architecture Simulator Discrete Event Simulation (DES) one global event queue event assigned to sinkers new event insert back into event queue Fine-grained EVENT A EVENT B

SimK: PDES Framework Open source Conservative PDES Highly optimized pthreads lock-free user-level thread scheduling Modularized use SimK API to implement a LP schedule, execschedule, exec commu, sync, buffer, deploycommu, sync, buffer, deploy APIAPI LPLP LPLP LPLP LPLP LPLP …… core Host SimK LP

From DES to PDES Seperate global queue Group sinkers into logical processes(LP), 1 queue/LP Event across LPs is wrapped with PDES time router core cache PDES time wrapper router core cache LP

router 1 E.g. Router Event before PDES time wrapper router 0 core 0 cache 0 router 1 core 1 cache 1 LP 0 LP 1 router 0 core 0 cache 0 core 1 cache 1 after Event Queue Router 0 send a event to router 1

Events from DES to PDES Single-thread  multi-threads Conservative PDES Simulation Time Thread 1 Thread 2 Thread 3 Thread 4 1 cycle event dependence

Grouping Into Big LPs Problem Avg. speedup is 1.8x with 16 thread (16 1-core LPs proto.) Cause of Problem too many LPs + lookahead is extremely small  high sync cost Solution grouping adjacent LPs into one big LP LP

Final Parallelized version Parallel Discrete Event Simulation sinkers grouped into big LPs LPs binded to threads using SimK API time sync between LPs using PDES sched and exec under SimK framework schedule, execschedule, exec commu, sync, buffer, deploycommu, sync, buffer, deploy APIAPI core Host SimK

Outline Motivation Implementation Evaluation Accuracy Speedup Conclusion

Evaluation Setup GAS v.s. P-GAS 4 Quad-Core AMD Opteron 8347 SMP 16 cores total, 64GB Memory Benchmark: SPLASH-2 kernel count benchmark computing time in wall-clock time

Cycle Count Error Avg. cycle count error: 0.04% 16

P-GAS Speedup 16 threads, SPLASH-2 Kernel Avg. speedup is 9.8x best speedup 13.6x(LU,16 threads) 5.3x super-linear speedup with 4 threads Avg. 9.8 Max

Why super-linear speedup? More cores, more caches to use The insert-to-queue time is shorter x super-linear speedup with 4 threads

Conclusion P-GAS use PDES to speedup a cycle-accurate many-core processor simulator speedup 9.8x on a 16-core SMP cycle error < 0.04% Highly optimized conservative PDES could be used in fast and accurate system simulation multi-core/many-core processor simulation SMP cluster, many-core cluster...

P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES Please me the questions: Open source release of our PDES framework: