System-level Exploration for Pareto- optimal Configurations in Parameterized Systems-on-a-chip Architectures Tony Givargis (Frank Vahid, Jörg Henkel) Center.

Slides:



Advertisements
Similar presentations
Experiments with the Peripheral Virtual Component Interface Roman L. Lysecky, Frank Vahid*, Tony D. Givargis Dept. of Computer Science & Engineering University.
Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Processing Efficiency Jonah Probell Multimedia Systems Engineer Tensilica Truly Understanding Low-Power Multimedia Chip Design.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.
High Performance Embedded Computing © 2007 Elsevier Lecture 15: Embedded Multiprocessor Architectures Embedded Computing Systems Mikko Lipasti, adapted.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
1 EL736 Communications Networks II: Design and Algorithms Class8: Networks with Shortest-Path Routing Yong Liu 10/31/2007.
Embedded Software Optimization for MP3 Decoder Implemented on RISC Core Yingbiao Yao, Qingdong Yao, Peng Liu, Zhibin Xiao Zhejiang University Information.
L27:Lower Power Algorithm for Multimedia Systems 성균관대학교 조 준 동
Instruction-based System-level Power Evaluation of System-on-a-chip Peripheral Cores Tony Givargis, Frank Vahid* Dept. of Computer Science & Engineering.
Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Parameterized Systems-on-a-Chip Frank Vahid Tony Givargis, Roman Lysecky, Leslie Tauro, Susan Cotterell Department of Computer Science and Engineering.
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
Define Embedded Systems Small (?) Application Specific Computer Systems.
Configurable System-on-Chip: Xilinx EDK
Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.
© imec 2001 ARRM’01, Oct.17 Managing dynamic concurrent tasks in real-time multi-media systems Francky Catthoor, IMEC, Belgium.
Tony GivargisUniversity of California, Riverside & NEC USA1 Fast Cache and Bus Power Estimation for Parameterized System-on-a-Chip Design Tony D. Givargis.
6/30/2015HY220: Ιάκωβος Μαυροειδής1 Moore’s Law Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips.
Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Propagating Constants Past Software to Hardware Peripherals Frank Vahid*, Rilesh Patel and Greg Stitt Dept. of Computer Science and Engineering University.
A Low-Power Low-Memory Real-Time ASR System. Outline Overview of Automatic Speech Recognition (ASR) systems Sub-vector clustering and parameter quantization.
Samsung Poland R&D Center © Samsung Electronics Co., LTD S/W Platform Team | Ver.DateDescriptionAuthorReviewer /09/18Initial VersionMarek.
RaPTEX: Rapid Prototyping of Embedded Communication Systems Dr. Alex Dean & Dr. Mihai Sichitiu (ECE) Dr. Tom Wolcott (MEAS) Motivation  Existing work.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
SYSTEM-ON-CHIP (SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
1 Recap (from Previous Lecture). 2 Computer Architecture Computer Architecture involves 3 inter- related components – Instruction set architecture (ISA):
Parameterized Embedded Systems Platforms Frank Vahid Students: Tony Givargis, Roman Lysecky, Susan Cotterell Dept. of Computer Science and Engineering.
Storage Allocation for Embedded Processors By Jan Sjodin & Carl von Platen Present by Xie Lei ( PLS Lab)
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
F. Gharsalli, S. Meftali, F. Rousseau, A.A. Jerraya TIMA laboratory 46 avenue Felix Viallet Grenoble Cedex - France Embedded Memory Wrapper Generation.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.
Test and Test Equipment Joshua Lottich CMPE /23/05.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch Dongkun Shin* and Jihong Kim Computer Architecture Lab School of Computer Science.
Mobile Agent Migration Problem Yingyue Xu. Energy efficiency requirement of sensor networks Mobile agent computing paradigm Data fusion, distributed processing.
Input-Output Organization
CAS 721 Course Project Implementing Branch and Bound, and Tabu search for combinatorial computing problem By Ho Fai Ko ( )
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
An Integrated Design Environment to Evaluate Power/Performance Tradeoffs for Sensor Network Applications Amol Bakshi, Jingzhao Ou, and Viktor K. Prasanna.
VLSI Algorithmic Design Automation Lab. THE TI OMAP PLATFORM APPROACH TO SOC.
Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
Multimedia Computing and Networking Jan Reduced Energy Decoding of MPEG Streams Malena Mesarina, HP Labs/UCLA CS Dept Yoshio Turner, HP Labs.
Lecture 7: Overview Microprocessors / microcontrollers.
Using Custom Accelerators in Wireless Systems Alex Papakonstantinou, Deming Chen Illinois Center for Wireless Systems Wireless SoC Design Trends and Challenges.
Multiprocessor SoC integration Method: A Case Study on Nexperia, Li Bin, Mengtian Rong Presented by Pei-Wei Li.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
Re-configurable Bus Encoding Scheme for Reducing Power Consumption of the Cross Coupling Capacitance for Deep Sub-micron Instructions Bus Siu-Kei Wong.
Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany.
System on a Programmable Chip (System on a Reprogrammable Chip)
Evaluating Register File Size
A High Performance SoC: PkunityTM
CprE 588 Embedded Computer Systems
Computer Evolution and Performance
Embedded Processors.
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

System-level Exploration for Pareto- optimal Configurations in Parameterized Systems-on-a-chip Architectures Tony Givargis (Frank Vahid, Jörg Henkel) Center for Embedded Computer Systems University of California Irvine, CA

2 Overview Given: –Parameterized SOC architecture Explore void main(){ while(1){ Receive(); Decode(); Display(); } Application –Fixed application Automatically explore the design space Find optimal points w/respect to power and performance SOC CPUMemory JPEG CODEC Math/FPU UART I$-D$ BRIDGE Size = {1K, 4K, 8K} Line = {4, 8, 16} Assoc = {1, 2, 4}

3 Motivation Design trends: –Growing demand for portable devices –Growing demand for low power design –Increased application complexity –Shrinking time-to- market windows Technology trends: –Increased chip capacity –Increased I/O pins –Improved on-chip integration techniques (storage, digital, analog, digital, …) –SOC era Need for greater designer productivity!

4 SOC CPU Memory JPEG CODEC Math/FPU UART MMX BRIDGE ?Motivation One approach: reuse of existing IP ? ? ? ? –IP selection ? MIPS RAM JPEG CODEC1 Math/FPU UART ISA BRIDGE ARM SRAM DRAM AMBA BRIDGE JPEG CODEC2 USB –IP integration ? –SOC verification ? –Multi-source IP licensing –More…

5 Motivation Alternate approach: reuse of SOC –Designed, integrated, tested –Domain specific –Parameterized Designed by firms specializing in SOC User: map application, then, “configure-and- execute” (successors to microcontrollers!) Parameterized SOC CPUMemory JPEG CODEC Math/FPU UART MMX BRIDGE

6 Motivation Composed of 100s of cores Cores are “configurable” Configurations impact power/performance Large number of total configurations! Architecture is otherwise fixed! Parameterized SOC CPUMemory JPEG CODEC Math/FPU UART MMX BRIDGE

7 Motivation ATI Technologies – XILLEON™ 220 SOC for Digital Set-top Box Market Tensilica – Xtensa™ 1040 configurable processor cores Philips Semiconductors – Velocity RSP9™ SOC platforms Adelante Technologies – offers complete SOC customizable platforms for DSP domains More…

8 Outline Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion

9 Outline Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion

10 Previous Work Parameterized SOC design –[Malik00], [Veidenbaum99], [Vahid99], [Stan95] Power/performance evaluation –[Barndolese00], [Simunic99], [Li98], [Tiwari94] Design space exploration (manual) –[givargis99], [Lieverse99] Design space exploration (automatic) –Focus of this work…

11 Previous Work Architecture Application Mapping Analysis Numbers Auto Y-chart [Lieverse99]

12 Outline Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion

13 Target Architecture UART MIPS I-Cache D-Cache Bridge Peripheral Bus DCT CODEC Memory DMA

14 Target Architecture Voltage scale Size, line, associativity Bus width, encoding (gray, invert) UART tx/rx buffer size DCT resol. UART MIPS I-Cache D-Cache Bridge Peripheral Bus DCT CODEC Memory DMA

15 Target Architecture Voltage scale Size, line, associativity Bus width, encoding (gray, invert) UART tx/rx buffer size DCT resol. UART MIPS I-Cache D-Cache Bridge Peripheral Bus DCT CODEC Memory DMA

16 Target Architecture Voltage scale Size, line, associativity Bus width, encoding (gray, invert) UART tx/rx buffer size DCT resol. UART MIPS I-Cache D-Cache Bridge Peripheral Bus DCT CODEC Memory DMA

17 Target Architecture Voltage scale Size, line, associativity Bus width, encoding (gray, invert) UART tx/rx buffer size DCT resol. UART MIPS I-Cache D-Cache Bridge Peripheral Bus DCT CODEC Memory DMA

18 Target Architecture Voltage scale Size, line, associativity Bus width, encoding (gray, invert) UART tx/rx buffer size DCT resol. UART MIPS I-Cache D-Cache Bridge Peripheral Bus DCT CODEC Memory DMA

19 Target Architecture 26 parameters configurations What are the optimal configuration (given a fixed application)? UART MIPS I-Cache D-Cache Bridge Peripheral Bus DCT CODEC Memory DMA

20 Problem Summary What are the possible power/performance tradeoffs? (100 trillion)  Need to efficiently evaluate power/performance (1/sec  150,000 years)  Need to explore the configuration space Parameterized SOC CPUMemory JPEG CODEC Math/FPU UART MMX BRIDGE

21 Outline Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion

22 Power Evaluation Exploration works with: –Chip instrumentation (real-time) –System-level simulation –RTL simulation –Gate-level simulation –Circuit-level simulation Relative accuracy required! Digital camera application mapped on our SOC, capturing 1 image

23 Power Evaluation Exploration works with: –Chip instrumentation (real-time) –System-level simulation –RTL simulation –Gate-level simulation –Circuit-level simulation Relative accuracy required! Digital camera application mapped on our SOC, capturing 1 image

24 Power Evaluation - Processor [Tiwari94/00]’s instruction- level Measure watt/inst Account for stalls + dependency Apply traces UART MIPS I-Cache D-Cache Bridge Peripheral Bus DCT CODEC Memory DMA

25 Power Evaluation – Cache/Mem. [Evans95] Capacitance model of sub- components Switching obtained via simulation (parameter dependent) UART MIPS I-Cache D-Cache Bridge Peripheral Bus DCT CODEC Memory DMA

26 Power Evaluation – Buses [Chern92] Model bus capacitance Switching derived from I/O traffic (parameter dependent) UART MIPS I-Cache D-Cache Bridge Peripheral Bus DCT CODEC Memory DMA

27 Power Evaluation – Peripherals Observation: cores execute instructions!  Apply a technique similar to that used for processors! UART MIPS I-Cache D-Cache Bridge Peripheral Bus DCT CODEC Memory DMA

28 Power Evaluation – Summary UART (5%) MIPS (10%) I-Cache (8%) D-Cache (8%) Bridge (5%) Peripheral Bus DCT CODEC (5%) Memory (8%) DMA (5%) ~50-100K instruction/second! (Platune)

29 Outline Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion

30 Exploration Problem formulation P 1, P 2, …, P n A configuration (point) is an assignment of values to all parameters How to efficiently generate all Pareto- optimal configurations?

31 Exploration * = 320 points Algorithm Idea A (10) B (32) A and B interdependent + = 42 points A and C are independent A (10) C (32) C and B are independent C (32) B (32) + = 64 points 138 points With knowledge about dependency we prune 98.6% * * = points B (32) C (32) A (10) Directed graph

32 Exploration A  B : Pareto-optimal configurations of B calculated after Pareto-optimal configurations of nodes along the path A  B A  B  A, (cycle) : Pareto-optimal configurations of all the parameters on the cycle calculated simultaneously A : Pareto-optimal configurations calculated in isolation

33 Exploration A B C D J K E F G H I N O L M R S P Q V W T U X Y Z Dependency Graph

34 A B C D J K E F G H I N O L M R S P Q V W T U X Y Z Dependency graph Based on designer knowledge Computed by simulating all pairs of nodes (quadratic time complexity, approx.) One time effort Exploration

35 Exploration – Algorithm Step 1: Clustering followed by simulation A B C D J K E F G H I N O L M R S P Q V W T U X Y Z

36 Exploration – Algorithm A,H,I B,C,D,E, F,G J,K,T,U L,M,P,Q N,O,V,W X,Y,R,S Z A,H,I,B, C,D,E,F, G J,K,T, U,Z L,M,P,Q, N,O,V,W X,Y,R,S A,H,I,B,C,D,E,F, G,J,K,T,U,Z L,M,P,Q,N,O,V, W,X,Y,R,S A,H,I,B,C,D,E,F,G,J,K,T,U,Z,L,M,P,Q, N,O,V,W,X,Y,R,S Step 2: Pair-wise merge followed by simulation

37 Exploration Exhaustive solution Evaluate all points Sort by decreasing execution time Walk through the space, eliminate points with power > minimum seen so far! Substitute heuristics (only works for 1-4 parameters!)

38 Exploration Complexity: O((K + log(K)) * 2 N/K ) K is the number of clusters N is the number of parameters 2 N/K bounds the exhaustive comp. (K + log(k)) bounds the number of iterations Worse case K=1, best case K=N 2 N/K decrease rapidly as K increases (e.g., 2 26/ /2 is much smaller than 2 26 !)

39 Outline Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion

40 Exploration – Results JPEG Exploration time: 29.1 min Config. visited: (141) 5.10x exe. time 7.51x power 2.73x energy Pruning ratio >

41 Exploration – Results CKEY Exploration time: 108 min Config. visited: (223) 8.31x exe. time 6.08x power 2.57x energy Pruning ratio >

42 Exploration – Results IMAGE Exploration time: 50.2 min Config. visited: (80) 8.29x exe. time 8.57x power 1.81x energy Pruning ratio >

43 Exploration – Results MATRIX Exploration time: 73.6 min Config. visited: (84) 10.7x exe. time 8.16x power 3.18x energy Pruning ratio >

44 Exploration – Results JPEG

45 Conclusion Gave a system-level algorithm for exploring the solution space of an application mapped to a parameterized SOC architectures –Given a dependency graph we extensively prune the solution space –Pruning ratio > in experiments Future work: –Automatically compute the dependency model –Replace the exhaustive sub-algorithm with a heuristic (e.g., gradient search, GA)