Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Slides:



Advertisements
Similar presentations
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
High Performing Cache Hierarchies for Server Workloads
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Allocating Memory.
Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.
- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Presenter : Cheng-Ta Wu Antti Rasmus, Ari Kulmala, Erno Salminen, and Timo D. Hämäläinen Tampere University of Technology, Institute of Digital and Computer.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Compiling with multicore Jeehyung Lee Spring 2009.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering.
Copyright 2013, Toshiba Corporation. DAC2013 Designer/User Track Scalability Achievement by Low-Overhead, Transparent Threads on an Embedded Many-Core.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Computer System Architectures Computer System Software
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Memory Management. Process must be loaded into memory before being executed. Memory needs to be allocated to ensure a reasonable supply of ready processes.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Multicore Resource Management 謝政宏. 2 Outline Background Virtual Private Machines  Spatial Component  Temporal Component  Minimum and Maximum.
Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
1 Memory Management Chapter Basic memory management 4.2 Swapping (εναλλαγή) 4.3 Virtual memory (εικονική/ιδεατή μνήμη) 4.4 Page replacement algorithms.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee *, Hyung Gyu Lee *, Soonhoi Ha.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Copyright © Curt Hill Parallelism in Processors Several Approaches.
1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Concurrency and Performance Based on slides by Henri Casanova.
Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Parallel programs Inf-2202 Concurrent and Data-intensive Programming Fall 2016 Lars Ailo Bongo
Computer Architecture: Parallel Task Assignment
Memory Management.
ISPASS th April Santa Rosa, California
Concurrent Data Structures for Near-Memory Computing
Parallel Programming By J. H. Wang May 2, 2017.
Computer Structure Multi-Threading
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Hyperthreading Technology
Department of Computer Science University of California, Santa Barbara
Kevin Lee & Adam Piechowicz 10/10/2009
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Presentation transcript:

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher J. Hughes, Corporate Technology Group, Intel Corporation Anthony Nguyen, Corporate Technology Group, Intel Corporation By Duygu AKMAN

2 Keywords  Multi-core Architectures  RMS applications  Large-Grain Parallelism / Fine-Grain Parallelism  Architectural Support

3 Multi-core Architectures  MCA  Higher performance than uniprocessor systems  Reduce communication latency and increase bandwidth between cores.  Applications need thread level parallelism to benefit.

4 Thread Level Parallelism  One common approach is partitioning a program into parallel tasks, and letting a software to schedule the tasks to different threads.  Useful, only if tasks are large enough so that software overhead is negligible. (e.g. Scientific applications)  RMS (Recognition, Mining and Synthesis) applications mostly have small tasks.

5 Reasons for Fine-Grain Parallelism  Also, MCAs can even be found at homes, which also tells us that fine-grained parallelism is necessary.  Some applications need to get good performance on different platforms with varying number of cores -> fine-grain tasks  In multiprogramming, the number of cores assigned to an application can change during execution. Need to maximize available parallelism -> fine-grain parallelism

6 Example (8 core MCA)

7  Two cases:  Application partitioned into 8 equal-sized tasks  Application partitioned into 32 equal-sized tasks  In a parallel section, when a core finishes its tasks, it waits for other cores -> waste of resources

8 Example (8 core MCA)  With 4 and 8 cores assigned to the application, all cores are fully utilized.  With 6 cores in first case, more waste resources(same performance with 4 cores) than the second case. Second case is finer-grained.

9  Problem is even worse with larger number of cores  With only 64 tasks, no performance improvement between 32 cores and 63 cores!  Need more tasks -> fine-grain parallelism

10 Contribution  Propose a hardware technique to accelerate dynamic task scheduling on MCAs.  Hardware queues that cache tasks and implement scheduling policies  Task prefetchers on each core to hide the latency of accessing the queues.

11 Workloads  Parallelized and analyzed RMS applications from areas including  Simulation for computer games  Financial analytics  Image processing  Computer vision, etc.  Some modules of these applications have large grained parallelism -> insensitive to task queuing overhead  But significant number of modules have to be parallelized at a fine granularity to achieve better performance

12 Architectural Support For Fine- Grained Parallelism  Overhead when queuing of tasks are handled by software  If tasks are small, this overhead can be a significant fraction of total execution time.  Contribution is adding hardware to MCAs for accelerating task queues.  Provides very fast access to the storage for tasks  Performs fast task scheduling

13 Proposed Hardware  An MCA chip where the cores are connected to a cache hierarchy by an on- die network.  Two separate hardware components  Local Task Unit (LTU) per core  Single Global Task Unit (GTU)

14 Proposed Hardware

15 Global Task Unit  GTU holds the logic of the implementation of the scheduling algorithm  GTU holds enqueued tasks in hardware queues. There is a hardware queue for each core  Since the queues are physically close to each other, scheduling is faster.  GTU is physically centralized and connection between the GTU and the cores is done via the same on-die interconnect as the caches.

16 Global Task Unit  Disadvantage of GTU is that as the number of cores increase, average communication latency between a core and GTU also increases.  This latency is hidden with the use of prefetchers (LTUs).

17 Local Task Unit  Each core has a small piece of hardware to communicate with the GTU.  If cores wait to contact the GTU until the thread running on them finishes its current task, thread will have to stall for the GTU access latency.  LTU also has a task prefetcher and a small buffer to hide the latency of accessing the GTU.

18 Local Task Unit  On a dequeue, if there is a task in LTU’s buffer, task is returned to the thread, and a prefetch for the next available task is sent to the GTU.  On an enqueue, task is placed in LTU’s buffer.  On an enqueue, task is placed in LTU’s buffer. Since proposed hardware uses a LIFO ordering of tasks for a given thread, if the buffer is already full, the oldest task in the buffer is sent to the GTU.

19 Experimental Evaluation  Benchmarks are from the RMS application domain  RMS = Recognition, Mining and Synthesis  Wide range of different areas  All benchmarks are parallelized

20 These benchmarks are straightforward to parallelize, each parallel loop simply specifies a range of indices and the granularity of tasks

21 Task-level parallelism is more general than loop-level parallelism where each parallel section starts with a set of initial tasks and any task may enqueue other tasks.

22 Benchmarks  In some of these benchmarks, task size is small, so task queue overhead must be small to effectively exploit the available parallelism.  In some, parallelism is limited.  In some, task sizes are highly variable, therefore a very efficient task management is needed for good load balancing.

23 Results  Results show the performance benefit of the proposed hardware for loop-level and task-level benchmarks, when running with 64 cores.   The hardware proposal is compared with   the best optimized software implementations   an idealized implementation (Ideal) in which tasks bypass the LTUs and are sent directly to/from GTU with zero interconnect latency and GTU processes these tasks instantly without any latency.

24 Results

25 Results

26 Results   The graphs represent the speedup over the one-thread execution using the Ideal implementation.   For each benchmark, multiple bars. Each bar corresponds to a different data set shown in Benchmark Tables

27 Results   For the loop-level benchmarks, the proposed hardware executes 88% faster on average than the optimized software implementation and only 3% slower than Ideal.   For the task-level benchmarks, on average the proposed hardware is 98% faster compared to the best software version and is within 2.7% of Ideal.

28 Conclusion  In order to benefit from the growing compute resources of MCAs, applications must expose their thread-level parallelism to hardware.   Previous work has proposed software implementations of dynamic task schedulers. But applications with large tasks, such as RMSs, achieve poor parallel speedups using software dynamic task scheduling. This is because the overhead of the scheduler are large for these applications.

29 Conclusion   To enable good parallel scaling even for applications with very small tasks,a hardware scheme is proposed.   It consists of relatively simple hardware and is tolerant to growing on-die latencies; therefore, it is a good solution for scalable MCAs.   When the proposed hardware, the optimized software task schedulers and an idealized hardware task scheduler is compared, we see that, for the RMS benchmarks, hardware gives large performance benefits over the software schedulers, and it comes very close to the idealized hardware scheduler.