University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications Hongtao Zhong, Steven A. Lieberman, and Scott A. Mahlke Advanced Computer Architecture Laboratory University of Michigan

Electrical Engineering and Computer Science 2 Multicore Architectures Multicore becomes a trend –Intel Core Duo, 2005 –Intel Core Quad, 2006 –Sun T1, 8 cores, 2005 –16 – 32 cores, near future Need for simpler cores –Power density –Cooling costs Multiple cores on a chip –High throughput –Good for multithreaded apps core 0 core 1core 2 core 3 core 0 core 1core 2 core 7 L2 8 core Sun T1 processor interconnect

University of Michigan Electrical Engineering and Computer Science 3 How About Single Thread Applications? Single thread performance, Core Duo vs. Pentium M (same cache, same platform) Source : Meldelson et al. Intel Technology Journal, Vol 10, Issue 02, 2006

University of Michigan Electrical Engineering and Computer Science 4 Objective of this Work Automatically accelerate single thread applications on multicore systems –Exploit irregular parallelism across cores Instruction level parallelism (ILP) Fine-grain thread level parallelism (TLP ) Loop level parallelism (LLP) –Adaptive architecture Configurate resources to exploit available parallelism Dynamic adaptability Hybrid parallelism

University of Michigan Electrical Engineering and Computer Science 5 Approach Voltron: Hardware/software approach –Architecture mechanisms Dual mode execution (coupled, decoupled) Flexible inter-core communication Fast thread spawning Efficient memory ordering High rate-of-return speculation –Compiler techniques Compiler controlled distributed branch Fine-grain thread extraction Speculative loop parallelization with recovery

University of Michigan Electrical Engineering and Computer Science 6 Parallelism Type 1: ILP + L >> * L + L + + / + / * L << - S - < + | & L | L - & S + br

University of Michigan Electrical Engineering and Computer Science 7 Parallelism Type 1: ILP Emulate VLIW –Low latency communication + L >> * L + L + + / + / * L << - S - < + | & L | L - & S + br Core 0 Core 1 Core 2 Core 3

University of Michigan Electrical Engineering and Computer Science 8 Parallelism Type 1: ILP Emulate VLIW –Low latency communication –Compiler controlled distributed branch –Lockstep execution + L >> * L + L + + / + / * L << - S - < + | & L | L - & S + br Core 0 Core 1 Core 2 Core 3 br

University of Michigan Electrical Engineering and Computer Science 9 Voltron Architecture for ILP stall bus br bus Core 0Core 1 Core 2Core 3 Banked L2 Cache GPRFPRPRBTR Register Files FU Mem FU... To northTo west L1 Instruction Cache L1 Data Cache From Banked L2 To/From Banked L2 Instruction Fetch/Decode Comm FU

University of Michigan Electrical Engineering and Computer Science 10 Experimental Setup Trimaran Toolset Simulator –Multiple cores, multiple instruction stream –Inter-core communication –MOESI coherent protocol Configuration –1 ALU, 1 memory unit, 1 communication unit per core –1 cycle inter-core move latency per hop –4KB L1 I-cache, 4KB L1 D-cache per core –128KB shared L2 cache –Single core baseline 25 benchmarks from SpecInt, SpecFP, and MediaBench

University of Michigan Electrical Engineering and Computer Science 11 ILP Speedup SpecInt Mediabench SpecFP Achieved > 80% of the performance on wide VLIW with same resources.

University of Michigan Electrical Engineering and Computer Science 12 Parallelism Type 2 : Fine-grain TLP C B D E C B D E Fine-grain threads –Few instructions –Scalar communication –Shared stack frame A

University of Michigan Electrical Engineering and Computer Science 13 Parallelism Type 2 : Fine-grain TLP ldstld st C B D E A C B D E Fine-grain threads –Few instructions –Scalar communication –Shared stack frame

University of Michigan Electrical Engineering and Computer Science 14 B D C E Parallelism Type 2 : Fine-grain TLP ldstld st A A’ Core 0 Core 1 Fine-grain threads –Few instruction –Scalar communication –Shared stack frame Decoupled execution –Different control flow –Asynchronous communication Fast thread spawning Efficient memory ordering Compiler algorithm –Memory dependences –Load balance

University of Michigan Electrical Engineering and Computer Science 15 Voltron for Fine-grain TLP GPRFPRPRBTR Register Files FU Mem FU... To northTo west L1 Instruction Cache L1 Data Cache From Banked L2 To/From Banked L2 Instruction Fetch/Decode Comm FU

University of Michigan Electrical Engineering and Computer Science 16 Dual Mode Network Coupled mode –Direct bypass [Multiflow] –Coupled execution –1 cycle min latency, num_hops Decoupled mode –Message queues [RAW] –SEND / RECV –Decoupled execution –3 cycle min latency, 2 + num_hops –Fast fine-grain thread spawning –Enforce operation ordering

University of Michigan Electrical Engineering and Computer Science 17 Fine-grain TLP Speedup SpecInt Mediabench SpecFP Works better for memory intensive applications * * * * * * * * *

University of Michigan Electrical Engineering and Computer Science 18 Parallelism Type 3 : LLP DOALL loops –No cross-iteration dependences –Iterations can execute in parallel –Memory dependences hard to prove

University of Michigan Electrical Engineering and Computer Science 19 Parallelism Type 3 : LLP DOALL loops –No cross-iteration dependences –Iterations can execute in parallel –Memory dependences hard to prove Statistical DOALL –Profile memory dependences –Speculatively parallelize –Detect violation and rollback core 0 init finalize reset iter 0-3 core 1 init finalize reset iter 4-7 iter 0-7 Unexpected dependence restart

University of Michigan Electrical Engineering and Computer Science 20 Voltron for LLP GPRFPRPRBTR Register Files FU Mem FU... To northTo west L1 Instruction Cache L1 D-cache w/ Transactional Mem Support From Banked L2 To/From Banked L2 Instruction Fetch/Decode Comm FU T tagstatedata cache Detect memory dependence violation Roll back memory state Compiler roll back register state

University of Michigan Electrical Engineering and Computer Science 21 LLP Speedup SpecIntMediabench SpecFP Accelerate non-provable DOALL and small loops

University of Michigan Electrical Engineering and Computer Science 22 Speedup for Hybrid Execution SpecInt Mediabench SpecFP 2 core average – ILP:1.23, TLP: 1.16, LLP: 1.17, Hybrid: 1.46 4 core average – ILP:1.33, TLP: 1.23, LLP: 1.37, Hybrid: 1.83

University of Michigan Electrical Engineering and Computer Science 23 Time Breakdown SpecInt Mediabench SpecFP Both coupled and decoupled mode are necessary.

University of Michigan Electrical Engineering and Computer Science 24 Conclusions and Future Work Voltron – Adaptive multicore system –Accelerate single thread applications –Exploit ILP, fine-grain TLP and statistical LLP Coupled and decoupled execution Dual-mode operand network Compiler managed loop speculation –Hybrid parallelism combines the benefits Future work –Fine-grain thread identification –Virtualization of resources

University of Michigan Electrical Engineering and Computer Science 25 Thank You Questions? For more information: http://cccp.eecs.umich.edu

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications."— Presentation transcript:

Similar presentations

About project

Feedback