University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,

Slides:



Advertisements
Similar presentations
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Advertisements

ECE 667 Synthesis and Verification of Digital Circuits
University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1 Cleared for Open Publication July 30, S-2144 P148/MAPLD 2004 Rea MAPLD 148:"Is Scaling the Correct Approach for Radiation Hardened Conversions.
Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
University of Michigan Electrical Engineering and Computer Science 1 Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed.
11 University of Michigan Electrical Engineering and Computer Science Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia *, Nathan.
University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.
1 Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors.
University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.
University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.
Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.
University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.
Study of AES Encryption/Decription Optimizations Nathan Windels.
SAGE: Self-Tuning Approximation for Graphics Engines
DEXA 2005 Quality-Aware Replication of Multimedia Data Yicheng Tu, Jingfeng Yan and Sunil Prabhakar Department of Computer Sciences, Purdue University.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.
Network Aware Resource Allocation in Distributed Clouds.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Automated Design of Custom Architecture Tulika Mitra
What have mr aldred’s dirty clothes got to do with the cpu
University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
European Test Symposium, May 28, 2008 Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence, RI Kundan.
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.
Title of Selected Paper: IMPRES: Integrated Monitoring for Processor Reliability and Security Authors: Roshan G. Ragel and Sri Parameswaran Presented by:
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
AN ARCHITECTURE FRAMEWORK FOR TRANSPARENT ISA CUSTOMIZATION IN EMBEDDED PROCESSORS VINAY GANGADHAR ECE 751 TALK, FALL 2015 DEPARTMENT.
Dual-Pipeline Heterogeneous ASIP Design Swarnalatha Radhakrishnan, Hui Guo, Sri Parameswaran School of Computer Science & Engineering University of New.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.
Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.
UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.
Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Asia and South Pacific Design Automation Conference.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis
University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
1 A Methodology for automatic retrieval of similarly shaped machinable components Mark Ascher - Dept of ECE.
University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,
Design-Space Exploration
Instructor: Dr. Phillip Jones
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Michael Chu, Kevin Fan, Scott Mahlke
CSCI1600: Embedded and Real Time Software
Ann Gordon-Ross and Frank Vahid*
Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,
Part IV Data Path and Control
CSCI1600: Embedded and Real Time Software
Presentation transcript:

University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark, Hongtao Zhong, Scott Mahlke Advanced Computer Architecture Lab University of Michigan, Ann Arbor December 3, 2003

University of Michigan Electrical Engineering and Computer Science 2 Motivation Cell phones, PDAs, digital cameras, etc. are everywhere –High performance yet low power design point General core + ASIC solution –Limited post-programmability General core + application specific instructions (CFUs) CPU ASIC CPU CFU

University of Michigan Electrical Engineering and Computer Science 3 What is a CFU? Combine multiple primitive operations –Smaller code size, fewer RF reads –Increases performance & | << ^ & * + ^ + + ^ + ^ | CFU 1 + ^ CFU 2 &<< | 2 ^ 2 *

University of Michigan Electrical Engineering and Computer Science 4 Automation is Key This is ¼ of the DFG for a single basic block of blowfish 159 XOR 164 SHR173 AND

University of Michigan Electrical Engineering and Computer Science 5 Related Work Tensilica Xtensa –Commercial example –MIPS core + manually constructed CFU Automatic instruction set synthesis is mature field –See paper for comparison of techniques Our contributions –Novel technique for automatic CFU creation –System to utilize CFUs in multiple applications –Analysis of how effectively CFUs for one application apply to other applications in the same domain

University of Michigan Electrical Engineering and Computer Science 6 System Overview Synthesis –Subgraph identification Discover candidates for CFUs Weed out what shouldn’t be picked –Selection Determine which candidates to use as CFUs Compilation –Subgraph replacement Make use of the CFUs in a range of applications

University of Michigan Electrical Engineering and Computer Science 7 Subgraph Identification Grow subgraphs from seed nodes –All nodes are seeds –Most directions don’t make sense How to decide where to grow? –Making decisions using factors similar to an architect –Take 4 factors into consideration Criticality, Latency, Area, Input/Output % ^ << +* & |

University of Michigan Electrical Engineering and Computer Science 8 Subgraph Identification Grow subgraphs from seed nodes –All nodes are seeds –Most directions don’t make sense How to decide where to grow? –Making decisions using factors similar to an architect –Take 4 factors into consideration Criticality, Latency, Area, Input/Output % ^ << +* & | CFU Candidates & <<

University of Michigan Electrical Engineering and Computer Science 9 Subgraph Identification Grow subgraphs from seed nodes –All nodes are seeds –Most directions don’t make sense How to decide where to grow? –Making decisions using factors similar to an architect –Take 4 factors into consideration Criticality, Latency, Area, Input/Output Sum of these factors determines value of each direction –NOT picking CFUs % ^ << +* & | CFU Candidates & <<& +

University of Michigan Electrical Engineering and Computer Science 10 Critical Path Combining operations on the critical path will shrink the longer dependence chains –Maximize potential performance gain Wt = –Slack is # cycles off longest dependence path ^& ^ >> &&& ++ << /(0+1) = 1010/(2+1) = 3.33

University of Michigan Electrical Engineering and Computer Science 11 Latency Growing toward low latency operations allows combination of more nodes in a cycle –Maximize DFG compression Wt = ^& ^ >> &&& ++ << *0.3 / 0.6 = 5 10*0.3 / 0.36 = 8.33 OpcodeAreaCycles & > 0.01~0.00 ^

University of Michigan Electrical Engineering and Computer Science 12 Area Want the most benefit for the least area Wt = Area is the sum of macrocell areas ^& ^ >> &&& ++ << *0.5/0.5 = 10 10*0.5/1.5 = 3.33 OpcodeAreaCycles & > 0.01~0.00 ^

University of Michigan Electrical Engineering and Computer Science 13 Input/Output Want CFUs to use as few RF ports as possible –Smaller encoding –Allow growth of larger candidates Wt = ^& ^ >> &&& ++ << *2/(2+1)= *2/(4+1)= 4

University of Michigan Electrical Engineering and Computer Science 14 Example ^& ^ >> &&& ++ <<

University of Michigan Electrical Engineering and Computer Science 15 Example ^& ^ >> &&& ++ <<

University of Michigan Electrical Engineering and Computer Science 16 Example ^& ^ >> &&& ++ <<

University of Michigan Electrical Engineering and Computer Science 17 Example ^& ^ >> &&& ++ <<

University of Michigan Electrical Engineering and Computer Science 18 Example ^& ^ >> && ++ << &

University of Michigan Electrical Engineering and Computer Science 19 Example & ^ >> && ++ << & ^

University of Michigan Electrical Engineering and Computer Science 20 Example & ^ >> && ++ << & ^

University of Michigan Electrical Engineering and Computer Science 21 Example & ^ >> && + << & ^ +

University of Michigan Electrical Engineering and Computer Science 22 Example & ^ >> && + << & ^ +

University of Michigan Electrical Engineering and Computer Science 23 Example & ^ >> && + ++ << & ^ +

University of Michigan Electrical Engineering and Computer Science 24 & ^ >> && + ++ << & ^ + Finished – Met External Constraints

University of Michigan Electrical Engineering and Computer Science 25 Set of Candidates ^ << ^ ^ & ^ && ^ && ^ ^ && + ^ ^ && + + ^ ^ && + + ^ ^ && + + ^ & ^ && + + ^

University of Michigan Electrical Engineering and Computer Science 26 Avoids Exponential Explosion Speedup

University of Michigan Electrical Engineering and Computer Science 27 Greedy Selection Heuristic Subgraph Number ValueCostOps 1204(3,4),(6,8) 261(1,3,7) ………… N95(1,7) Subgraph Number ValueCostOps 1104(6,8) 261(1,3,7) ………… N05 Use estimates of performance improvement / cost

University of Michigan Electrical Engineering and Computer Science 28 Multiple applications can utilize CFUs Vflib pattern matcher [Cor ’99] Compiler Replacement Instruction Synthesis CFU Description Compiler 3 5 CFU 4 2 1

University of Michigan Electrical Engineering and Computer Science 29 Experimental Setup Implemented in the Trimaran toolset Baseline machine: 1 Int, 1 Flt, 1 Br, 1 Mem/Cycle –CFUs use Int issue slot CFU latency/area generated as sum of each individual macrocell –Pipeline latches were added if CFU latency >1 clock cycle –300 MHz clock assumed –No branch or memory instructions in CFUs Four application domains tested –Audio, Encryption, Image, Network

University of Michigan Electrical Engineering and Computer Science 30 Native Encryption Results

University of Michigan Electrical Engineering and Computer Science 31 Encryption Cross Compile

University of Michigan Electrical Engineering and Computer Science 32 Generalizing CFUs Subsumed (Multiple Paths) Wildcards (Multiple Nodes) >> | + IN_10x8 0xF IN_2 >> | + IN_1 0x0 0x8, 0x0 0x0 0xF, 0x0 IN_2 >> & |,& - +,- IN_10x8 0xF IN_2

University of Michigan Electrical Engineering and Computer Science 33 Effects of Generalization blowfish bfish-rijn bfish-sha rijndael rijn-bfish rijn-sha sha sha-bfish sha-rijn CFUsSubsumed Subgraphs Speedup

University of Michigan Electrical Engineering and Computer Science 34 Conclusions Developed two phase instruction set synthesis system –Guide function removes bad candidates –Greedy selection heuristic Substantial speedups can be attained with very little die impact Subsumed subgraphs and wildcarding increase cross- application effectiveness DomainEncryptionNetworkImageAudio Ave. Speedup

University of Michigan Electrical Engineering and Computer Science 35 Questions?

University of Michigan Electrical Engineering and Computer Science 36 Backup slides

University of Michigan Electrical Engineering and Computer Science 37 Individual Factors - Blowfish

University of Michigan Electrical Engineering and Computer Science 38 Individual Factors - Djpeg

University of Michigan Electrical Engineering and Computer Science 39 Selection Uses estimates of performance improvement Greedy Heuristic used ^& ^ >> &&& ++ <<