University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Slides:



Advertisements
Similar presentations
Chapter 1. Basic Structure of Computers
Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
University of Michigan Electrical Engineering and Computer Science 1 Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed.
University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,
University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.
Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
Spring 2006 Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington Presented by Brett Meyer.
Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.
Probabilistic Predicate-Aware Modulo Scheduling Mikhail Smelyanskiy 1, Scott Mahlke, Edward Davidson Department of EECS University of Michigan 1 Currently.
Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
Hyunchul Park†, Kevin Fan†, Scott Mahlke†,
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,
1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
15-740/ Computer Architecture Lecture 3: Performance
Prof. Hsien-Hsin Sean Lee
Variable Word Width Computation for Low Power
Design-Space Exploration
5.2 Eleven Advanced Optimizations of Cache Performance
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Pipelining: Advanced ILP
Michael Chu, Kevin Fan, Scott Mahlke
Software Cache Coherent Control by Parallelizing Compiler
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Ka-Ming Keung Swamy D Ponpandi
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
The Vector-Thread Architecture
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan, Scott Mahlke, and Michael Schlansker* Advanced Computer Architecture Laboratory University of Michigan *HP Laboratories

University of Michigan Electrical Engineering and Computer Science 2 Motivation VLIW Scaling Problem ► Centralized resource ► Highly ported structures ► Wire delays FU Register File Instruction Fetch/Decode FU … Register File Instruction Fetch/Decode FU

University of Michigan Electrical Engineering and Computer Science 3 Multicluster VLIW Distribute register files Cluster function units Distribute data caches Clusters communicate through interconnection network Used in TI C6x, Lx/ST200, Analog Tigersharc FU Register File Interconnection network Instruction Fetch/Decode Cluster 0 Cluster 1

University of Michigan Electrical Engineering and Computer Science 4 Control Path Scaling Problem Larger I-cache Latency ► Long wires for control signals distribution Code compression ► Hardware cost, power ► Grow quadratically with the number of FUs GFED CBAX PC B A I-cache IR align/shift network NOP

University of Michigan Electrical Engineering and Computer Science 5 Straight Forward Approach Distribute I-fetch in spirit similar to distribution of data path ► Local communication of controls ► Reduce latency, hardware cost, power Used in Multiflow Trace 14/300 processors I-cache PC IR Interconnection network PC FU Register File Interconnection network I-cache IR FU Register File

University of Michigan Electrical Engineering and Computer Science 6 DVLIW Approach Simple distribution has problems ► Doesn’t support code compression ► PC still a centralized resource I-cache FU Register File PC0 IR Interconnection network I-cache FU Register File PC IR Interconnection network align/shift PC1 align/shift

University of Michigan Electrical Engineering and Computer Science 7 DVLIW Execution Model Clusters execute in lock-step ► When one cluster stalls, all clusters stall Clusters collectively execute one thread ► Each cluster runs an instruction stream ► Compiler orchestrates the execution of streams ► Compiler manages communication ► Light weight synchronization

University of Michigan Electrical Engineering and Computer Science 8 DVLIW Benefits Completely decentralized architecture ► Distributed data path ► Distributed control path Supports arbitrary code compression Exploiting ILP on multi-core style system ► Good for embedded applications ► Low cost ► Compiler support

University of Michigan Electrical Engineering and Computer Science 9 DVLIW Architecture VLIW Cluster 0 VLIW Cluster 1 VLIW Cluster 3 VLIW Cluster 2 Banked L2 br_target PC Next PC B NOP A BA L1 D-Cache L1 I-Cache IR Register Files … align/shift IC MFU FU … To Banked L2 Banked L2 To cluster 2 To cluster 1

University of Michigan Electrical Engineering and Computer Science 10 Code Organization Code for each cluster is consecutive in memory Operations in the same MultiOp stored in different memory locations Each cluster computes its own next PC A1 A2 A3 A4 A5 B1 B2 B3 B4 … … A1 A2 A3 B1 B2 … … A4 A5 B3 B4 Conventional VLIW DVLIW PC PC0 PC1

University of Michigan Electrical Engineering and Computer Science 11 Branch Mechanism Maintain correct execution order ► All clusters transfer control at the same cycle ► All clusters branch to the same logical multiop Unbundled branch in HPL-PD Branch PBRbtr1, TARGET CMPPpr0, (x>100)? BRbtr1, pr0 Each cluster specifies its own target Broadcast to all clusters Replicated in each cluster

University of Michigan Electrical Engineering and Computer Science 12 Branch Handling Example … pbr btr1, BB2 cmpp pr0, (x>100)? … br btr1, pr0 … pbr btr1, BB2 cmpp pr0, (x>100)? bcast pr0 br btr1, pr0 … pbr btr1, BB2’ …. br btr1, pr0 Conventional VLIW DVLIW Cluster 0 Cluster 1

University of Michigan Electrical Engineering and Computer Science 13 Sleep Mode Idle blocks after distribution Put cluster into sleep mode ► Compiler managed ► Save energy ► Reduce code size Mode change happens at block boundary BR Cluster 0 Cluster 1 BR SLEEP WAKE BR

University of Michigan Electrical Engineering and Computer Science 14 Experimental Setup Trimaran toolset Processor configuration ► 4 clusters, 2 INT, 1 FP, 1 MEM, 1 BR per cluster ► 16K L1 I-cache total ► Perfect data cache assumed Power Model ► Verilog for instruction align/shift logic ► Wire model ► Cacti cache model 21 benchmarks from MediaBench and SPECINT2000

University of Michigan Electrical Engineering and Computer Science 15 Change in Global Communication Bits MediaBench SPECINT

University of Michigan Electrical Engineering and Computer Science 16 Normalized Energy Consumption on Control Path Control path energy = (align/shift logic energy) + (wire energy) + (I-cache energy) 40% saving 67% saving 80% saving 21% saving

University of Michigan Electrical Engineering and Computer Science 17 Normalized Code Size Baseline: Conventional VLIW with compressed encoding Traditional method (single PC): 7x increase DVLIW: 40% increase

University of Michigan Electrical Engineering and Computer Science 18 Result Summary DVLIW benefits ► Order of magnitude reduction in global communication ► 40% savings in control path energy ► 5x code size reduction vs. simple distribution Small overhead for ILP execution on CMP ► 3% increase in execution cycles ► 4% increase in I-cache stalls

University of Michigan Electrical Engineering and Computer Science 19 Conclusions DVLIW removes last centralized resource in a multicluster VLIW ► Fully distributed control path ► Scalable architecture More energy efficient Stylized CMP architecture ► Exploit ILP ► Multiple instruction streams ► Compiler orchestrated

University of Michigan Electrical Engineering and Computer Science 20 Thank You For more information ►