University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Slides:

Advertisements

Similar presentations

Chapter 3 Embedded Computing in the Emerging Smart Grid Arindam Mukherjee, ValentinaCecchi, Rohith Tenneti, and Aravind Kailas Electrical and Computer.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Instruction Level Parallelism (ILP) Colin Stevens.

University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.

SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

Chapter Hardwired vs Microprogrammed Control Multithreading

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González.

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.

Chapter 17 Parallel Processing.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.

Chapter 18 Multicore Computers

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.

Illusionist: Transforming Lightweight Cores into Aggressive Cores on Demand I2PC March 28, 2013 Amin Ansari 1, Shuguang Feng 2, Shantanu Gupta 3, Josep.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Niagara: a 32-Way Multithreaded SPARC Processor

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Probabilistic Predicate-Aware Modulo Scheduling Mikhail Smelyanskiy 1, Scott Mahlke, Edward Davidson Department of EECS University of Michigan 1 Currently.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

My Coordinates Office EM G.27 contact time:

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

Carnegie Mellon /18-243: Introduction to Computer Systems Instructors: Anthony Rowe and Gregory Kesden 27 th (and last) Lecture, 28 April 2011 Multi-Core.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

INTRODUCTION TO MULTISCALAR ARCHITECTURE

CS 352H: Computer Systems Architecture

COMP 740: Computer Architecture and Implementation

Lynn Choi School of Electrical Engineering

Computer Architecture Principles Dr. Mike Frank

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

/ Computer Architecture and Design

Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal

IA-64 Microarchitecture --- Itanium Processor

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

The Vector-Thread Architecture

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Chapter 4 Multiprocessors

CSC3050 – Computer Architecture

Chip&Core Architecture

The University of Adelaide, School of Computer Science

Advanced Computer Architecture 5MD00 / 5Z032 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.

Spring’19 Prof. Eric Rotenberg

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications Hongtao Zhong, Steven A. Lieberman, and Scott A. Mahlke Advanced Computer Architecture Laboratory University of Michigan

Electrical Engineering and Computer Science 2 Multicore Architectures Multicore becomes a trend –Intel Core Duo, 2005 –Intel Core Quad, 2006 –Sun T1, 8 cores, 2005 –16 – 32 cores, near future Need for simpler cores –Power density –Cooling costs Multiple cores on a chip –High throughput –Good for multithreaded apps core 0 core 1core 2 core 3 core 0 core 1core 2 core 7 L2 8 core Sun T1 processor interconnect

University of Michigan Electrical Engineering and Computer Science 3 How About Single Thread Applications? Single thread performance, Core Duo vs. Pentium M (same cache, same platform) Source : Meldelson et al. Intel Technology Journal, Vol 10, Issue 02, 2006

University of Michigan Electrical Engineering and Computer Science 4 Objective of this Work Automatically accelerate single thread applications on multicore systems –Exploit irregular parallelism across cores Instruction level parallelism (ILP) Fine-grain thread level parallelism (TLP ) Loop level parallelism (LLP) –Adaptive architecture Configurate resources to exploit available parallelism Dynamic adaptability Hybrid parallelism

University of Michigan Electrical Engineering and Computer Science 5 Approach Voltron: Hardware/software approach –Architecture mechanisms Dual mode execution (coupled, decoupled) Flexible inter-core communication Fast thread spawning Efficient memory ordering High rate-of-return speculation –Compiler techniques Compiler controlled distributed branch Fine-grain thread extraction Speculative loop parallelization with recovery

University of Michigan Electrical Engineering and Computer Science 6 Parallelism Type 1: ILP + L >> * L + L + + / + / * L << - S - < + | & L | L - & S + br

University of Michigan Electrical Engineering and Computer Science 7 Parallelism Type 1: ILP Emulate VLIW –Low latency communication + L >> * L + L + + / + / * L << - S - < + | & L | L - & S + br Core 0 Core 1 Core 2 Core 3

University of Michigan Electrical Engineering and Computer Science 8 Parallelism Type 1: ILP Emulate VLIW –Low latency communication –Compiler controlled distributed branch –Lockstep execution + L >> * L + L + + / + / * L << - S - < + | & L | L - & S + br Core 0 Core 1 Core 2 Core 3 br

University of Michigan Electrical Engineering and Computer Science 9 Voltron Architecture for ILP stall bus br bus Core 0Core 1 Core 2Core 3 Banked L2 Cache GPRFPRPRBTR Register Files FU Mem FU... To northTo west L1 Instruction Cache L1 Data Cache From Banked L2 To/From Banked L2 Instruction Fetch/Decode Comm FU

University of Michigan Electrical Engineering and Computer Science 10 Experimental Setup Trimaran Toolset Simulator –Multiple cores, multiple instruction stream –Inter-core communication –MOESI coherent protocol Configuration –1 ALU, 1 memory unit, 1 communication unit per core –1 cycle inter-core move latency per hop –4KB L1 I-cache, 4KB L1 D-cache per core –128KB shared L2 cache –Single core baseline 25 benchmarks from SpecInt, SpecFP, and MediaBench

University of Michigan Electrical Engineering and Computer Science 11 ILP Speedup SpecInt Mediabench SpecFP Achieved > 80% of the performance on wide VLIW with same resources.

University of Michigan Electrical Engineering and Computer Science 12 Parallelism Type 2 : Fine-grain TLP C B D E C B D E Fine-grain threads –Few instructions –Scalar communication –Shared stack frame A

University of Michigan Electrical Engineering and Computer Science 13 Parallelism Type 2 : Fine-grain TLP ldstld st C B D E A C B D E Fine-grain threads –Few instructions –Scalar communication –Shared stack frame

University of Michigan Electrical Engineering and Computer Science 14 B D C E Parallelism Type 2 : Fine-grain TLP ldstld st A A’ Core 0 Core 1 Fine-grain threads –Few instruction –Scalar communication –Shared stack frame Decoupled execution –Different control flow –Asynchronous communication Fast thread spawning Efficient memory ordering Compiler algorithm –Memory dependences –Load balance

University of Michigan Electrical Engineering and Computer Science 15 Voltron for Fine-grain TLP GPRFPRPRBTR Register Files FU Mem FU... To northTo west L1 Instruction Cache L1 Data Cache From Banked L2 To/From Banked L2 Instruction Fetch/Decode Comm FU

University of Michigan Electrical Engineering and Computer Science 16 Dual Mode Network Coupled mode –Direct bypass [Multiflow] –Coupled execution –1 cycle min latency, num_hops Decoupled mode –Message queues [RAW] –SEND / RECV –Decoupled execution –3 cycle min latency, 2 + num_hops –Fast fine-grain thread spawning –Enforce operation ordering

University of Michigan Electrical Engineering and Computer Science 17 Fine-grain TLP Speedup SpecInt Mediabench SpecFP Works better for memory intensive applications * * * * * * * * *

University of Michigan Electrical Engineering and Computer Science 18 Parallelism Type 3 : LLP DOALL loops –No cross-iteration dependences –Iterations can execute in parallel –Memory dependences hard to prove

University of Michigan Electrical Engineering and Computer Science 19 Parallelism Type 3 : LLP DOALL loops –No cross-iteration dependences –Iterations can execute in parallel –Memory dependences hard to prove Statistical DOALL –Profile memory dependences –Speculatively parallelize –Detect violation and rollback core 0 init finalize reset iter 0-3 core 1 init finalize reset iter 4-7 iter 0-7 Unexpected dependence restart

University of Michigan Electrical Engineering and Computer Science 20 Voltron for LLP GPRFPRPRBTR Register Files FU Mem FU... To northTo west L1 Instruction Cache L1 D-cache w/ Transactional Mem Support From Banked L2 To/From Banked L2 Instruction Fetch/Decode Comm FU T tagstatedata cache Detect memory dependence violation Roll back memory state Compiler roll back register state

University of Michigan Electrical Engineering and Computer Science 21 LLP Speedup SpecIntMediabench SpecFP Accelerate non-provable DOALL and small loops

University of Michigan Electrical Engineering and Computer Science 22 Speedup for Hybrid Execution SpecInt Mediabench SpecFP 2 core average – ILP:1.23, TLP: 1.16, LLP: 1.17, Hybrid: core average – ILP:1.33, TLP: 1.23, LLP: 1.37, Hybrid: 1.83

University of Michigan Electrical Engineering and Computer Science 23 Time Breakdown SpecInt Mediabench SpecFP Both coupled and decoupled mode are necessary.

University of Michigan Electrical Engineering and Computer Science 24 Conclusions and Future Work Voltron – Adaptive multicore system –Accelerate single thread applications –Exploit ILP, fine-grain TLP and statistical LLP Coupled and decoupled execution Dual-mode operand network Compiler managed loop speculation –Hybrid parallelism combines the benefits Future work –Fine-grain thread identification –Virtualization of resources

University of Michigan Electrical Engineering and Computer Science 25 Thank You Questions? For more information: