Burleson, UMASS1 Adaptive System on a Chip (ASOC): A Backbone for Power-Aware Signal Processing Cores Andrew Laffely, Jian Liang, Russ Tessier and Wayne.

Slides:



Advertisements
Similar presentations
3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.
Advertisements

Programmable FIR Filter Design
VADA Lab.SungKyunKwan Univ. 1 L3: Lower Power Design Overview (2) 성균관대학교 조 준 동 교수
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.
Power Reduction Techniques For Microprocessor Systems
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Addressing the System-on-a-Chip Interconnect Woes Through Communication-Based Design N. Vinay Krishnan EE249 Class Presentation.
Burleson, UMASS1 Using System-on-a- Chip as a Vehicle for VLSI Design Education Andrew Laffely and Wayne Burleson Electrical and Computer Engineering University.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Dynamically Parameterized Architectures for Power Aware Video Coding: Motion Estimation and DCT Wayne Burleson Prashant Jain
On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.
Adaptive System on a Chip (aSoC) for Low-Power Signal Processing Andrew Laffely, Jian Liang, Prashant Jain, Ning Weng, Wayne Burleson, Russell Tessier.
Issues in System-Level Direct Networks Jason D. Bakos.
Alpha Goal: very fast multiprocessor systems, highly scalable Main trick is high-bandwidth, low-latency data access. How to do it, how to do it?
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Approaching Ideal NoC Latency with Pre-Configured Routes George Michelogiannakis, Dionisios Pnevmatikatos and Manolis Katevenis Institute of Computer Science.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,
1 VLSI Design SMD154 LOW-POWER DESIGN Magnus Eriksson & Simon Olsson.
Power Reduction for FPGA using Multiple Vdd/Vth
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Low-Power Wireless Sensor Networks
CAD for Physical Design of VLSI Circuits
SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
Paper Review: XiSystem - A Reconfigurable Processor and System
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
Automated Design of Custom Architecture Tulika Mitra
Logic Synthesis for Low Power(CHAPTER 6) 6.1 Introduction 6.2 Power Estimation Techniques 6.3 Power Minimization Techniques 6.4 Summary.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.
J. Christiansen, CERN - EP/MIC
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Chonnam national university VLSI Lab 8.4 Block Integration for Hard Macros The process of integrating the subblocks into the macro.
ELEC692/04 course_des 1 ELEC 692 Special Topic VLSI Signal Processing Architecture Fall 2004 Chi-ying Tsui Department of Electrical and Electronic Engineering.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
HYPER: An Interactive Synthesis Environment for Real Time Applications Introduction to High Level Synthesis EE690 Presentation Sanjeev Gunawardena March.
COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
1 Interconnect/Via. 2 Delay of Devices and Interconnect.
Basics of Energy & Power Dissipation
IMPLEMENTATION OF MIPS 64 WITH VERILOG HARDWARE DESIGN LANGUAGE BY PRAMOD MENON CET520 S’03.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수
1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture.
Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,
CS203 – Advanced Computer Architecture
-1- Soft Core Viterbi Decoder EECS 290A Project Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang.
Lynn Choi School of Electrical Engineering
ASIC Design Methodology
Andrea Acquaviva, Luca Benini, Bruno Riccò
Stateless Combinational Logic and State Circuits
Architecture & Organization 1
Architecture & Organization 1
A High Performance SoC: PkunityTM
HIGH LEVEL SYNTHESIS.
Computer Evolution and Performance
Low Power Digital Design
Presentation transcript:

Burleson, UMASS1 Adaptive System on a Chip (ASOC): A Backbone for Power-Aware Signal Processing Cores Andrew Laffely, Jian Liang, Russ Tessier and Wayne Burleson Electrical and Computer Engineering University of Massachusetts Amherst This material is based upon work supported by the National Science Foundation under Grant No and SRC Tasks 766 and 1075

Burleson/UMASS2 Challenges in Media Processing Increasingly complex, heterogeneous algorithms Variable run-times (e.g. data-dependent iterations) Variable quality Variable power consumption Large data-sets, usually streaming Memory size, ports and latency issues Advancing semiconductor technology (Moore’s Law) Interconnect (on-chip and I/O) Clocking Power (consumption and distribution) Design and Verification

Burleson/UMASS3 aSoC: adaptive System on a Chip Tiled SoC architecture DCT VLE MemoryViterbiFIR EncryptControl Motion Estimation and Compensation

Burleson/UMASS4 aSoC: adaptive System on a Chip Tiled SoC architecture Supports the use of independently developed heterogeneous cores Pick and place cores which best perform the given application Increase performance Save power Cores may be any number of tiles in size DCT VLE MemoryViterbiFIR EncryptControl Motion Estimation and Compensation

Burleson/UMASS5 aSoC: adaptive System on a Chip Tiled SoC architecture Supports the use of independently developed heterogeneous cores Connected with an interconnect mesh Restricted to near neighbor communications Creates pipeline Decreases cycle time DCT VLE MemoryViterbiFIR EncryptControl Motion Estimation and Compensation

Burleson/UMASS6 aSoC: adaptive System on a Chip Tiled SoC architecture Supports the use of independently developed heterogeneous cores Connected with an optimized fixed interconnect mesh Using a communication interface (CI) to manage data Network port (Coreport) for each core, I/O queues,handshake Each CI uses a memory and FSM to repetitively process a predefined (static) schedule of communications High-speed 5x5 bidirectional crossbar DCT VLE MemoryViterbiFIR EncryptControl Motion Estimation and Compensation

Burleson/UMASS7 Communication Interface Custom design to maximize speed and reduce power Core-ports Crossbar Controller Instruction memory Local frequency and voltage supply Core Core-ports Decoder Local Frequency & Voltage North to South & East Instruction Memory PC Controller North South East West Local Config. North South East West Inputs Outputs Crossbar

Burleson/UMASS8 aSoC Implementation and Integration  TSMC technology Full custom

Burleson/UMASS9 Research Thrusts aSoC Infrastructure 1,3 Communication Interface Interconnect 3 Power Distribution Clock System Power Management Design Technology Compiler 1,3 (Partitioner, Mapper, Placer, Scheduler) Simulator 1 Cores Motion estimation 2,3 Discrete Cosine Transform 2,3 AES Cryptography 3 Huffman Coding Adaptive Viterbi 2,3 3D Graphics 1,2,3 Smart Card 2,3 MP3 ARM DSP Cache 2,3 FPGA MAC 1 PhD Dissertation 2 Masters Thesis 3 Publications

Burleson/UMASS10 Voltage Scaling Approach Core-ports Single buffer for each stream to cross clock/voltage barrier between core and interface Reading/Writing success rates indicate core utilization Input blocked: Core too slow Output blocked: Core too fast Controller Interprets core-port success rates to adjust local clock and voltage Interconnect Buffer Input Core-port Output Core-port Core Clock and Supply Controller Local Vdd Local Clock Blocked Processing Pipeline

Burleson/UMASS11 Vdd Selection Criteria Voltage Normalized Delay 0.73 As Vdd decreases delay increases exponentially Use curve to match available clock frequencies to voltages The voltage and frequency change reduces power by 79%, 96%, and 98.7% P =  C(Vdd) 2 f Normalized Core Critical Path Delay vs. Vdd Max Speed 1/4 Speed 1/2 Speed 1/8 Speed 1.16

Burleson/UMASS12 Architecture Evaluation (Motion Estimation) Array-based architecture Pipelined ME Parameterized search window size Full search Choose 16x16 or 8x8 windows Reduce power Address Generation Unit Processing Element Array Memory FIFOs

Burleson/UMASS13 Power Aware Core Custom motion estimation core Choose search method Full search mW (bit width and pel sub-sampling) Spiral search 76mW Three step search 25mW Data taken with Synopsys TM Power Compiler at the RTL level

Burleson/UMASS14 aSoC Support Multiple streams in and out through dedicated core ports Easy to manage on both sides of the port Schedule configuration streams in with the data Stream A: Input Frame Stream B: Configuration (Choose search mode and size) Stream C: Motion Vectors Motion Estimation Core in1in2out2out1 Stream A Stream B Stream C Coreports

Burleson/UMASS15 Reconfigurable Interconnect P-frame I-frame MEMC - +  Input Frame DCT Input Frame DCT

Burleson/UMASS16 aSoC Support Lumped ME, MC and Summation into one double core DCT Motion Estimation & Compensation

Burleson/UMASS17 aSoC Support: P-Frame Input Frame (Stream A) DCT Motion Estimation & Compensation Difference Frame (Stream B)

Burleson/UMASS18 aSoC Support: Schedule Change Input Frame (Stream A) DCT Motion Estimation & Compensation Difference Frame (Stream B) Configuration Streams (C & D)

Burleson/UMASS19 aSoC Support: Schedule Change Input Frame (Stream A) DCT Motion Estimation & Compensation Difference Frame (Stream B) Configuration (Streams C) Schedule 1 Schedule 2 PC

Burleson/UMASS20 aSoC Support: Schedule Change Input Frame (Stream A) DCT Motion Estimation & Compensation Difference Frame (Stream B) Configuration (Streams C) Schedule 1 Schedule 2 PC

Burleson/UMASS21 aSoC Support: Schedule Change Input Frame (Stream A) DCT Motion Estimation & Compensation Configuration (Streams D) Schedule 1 Schedule 2 PC

Burleson/UMASS22 aSoC Support: Schedule Change Input Frame (Stream A’) DCT Motion Estimation & Compensation Configuration (Streams D) Schedule 1 Schedule 2 PC

Burleson/UMASS23 aSoC Support: I-Frame Input Frame (Stream A’) DCT Motion Estimation & Compensation OFF

Burleson/UMASS24 Operating Frequency? Interconnect synchronized H-tree clock distribution Core frequencies depend on critical path Tile provides clock reference Coreport provides asynchronous boundary Dynamic core configuration requires dynamic clock configuration aSoC clock reference provides multiples of interconnect clock (… 4x, 2x, 1x, 0.5x, 0.25x, …) Configured through the tile controller

Burleson/UMASS25 Clock Distribution 64 tile aSoC70nm100nm130nm180nm Chip Area(9.24mm) 2 (13.3mm) 2 (17.2mm) 2 (23.8mm) 2 Frequency5 GHz2 GHz1 GHz0.5 GHz Power126 mW240 mW445 mW784 mW Mean Skew41 ps50 ps92 ps70.6 ps Percent Skew21 %10 %9 %4 % Tile Tiled architecture extends life of globally synchronous systems Precise H-tree implementation Load is small and equal at each branch Skew can be reduced by 70% with advanced deskew circuits 1 1 S. Tan et al. “Clock Generation and Distribution for the First IA-64 Microprocessor” IEEE JSSC, Nov. 2000

Burleson/UMASS26 Mixed vs. Fixed Core Frequencies Cores not designed with clock gating Core power from Synopsys RTL simulation Interconnect from SPICE Assumes 10 cycle schedule, 4 pixels/word

Burleson/UMASS27 Current Density and Clocking Red: fixed worst case clocking Short spikes of high current Green: optimal independent clocking Slow and low Optimal clocking eliminates current spikes (also improved battery life) DeadlineProcess Start ME: Full Search ME: Spiral ME: Three Step Search DCT Time Current

Burleson/UMASS28 Power Distribution 64 tile aSoCVhVh V mh V ml VlVl Voltage1.8V1.16V0.73V0.6V Current per Core 110mA25mA13mA7mA Total Power12.1 W1.86 W607 mW269 mW Heterogeneous power-aware cores require multiple power supply voltages Tile structure enables uniform interwoven grid Larger grid for higher current demands Reduced resistance Higher capacitance Gnd VhVh VlVl V ml V mh

Burleson/UMASS29 Advanced Signaling Techniques (building on SRC-funded work) Differential current sensingBooster Insertion Multi-level current signaling Phase coding

Burleson/UMASS30 Interconnect Characterization: Comparing delay and power of signaling techniques for different tile sizes at 250nm, 180nm, 130nm, 100n (available via web-based tool Network on Chip Interconnect Calculator NOCIC)

Burleson/UMASS31 Conclusions Regular Tiled Architecture Task-based parallelism using heterogeneous cores Predictable interconnect Regular core interface, Vdd and clock control, and configuration control Static scheduling High-level global schedule of inter-core communication Accomodates dynamic workloads with queues and local handshakes Demonstration using Motion Estimation and DCT Variable search window and search algorithm provide power/quality tradeoff Power savings using scalable approaches to dynamic clock and power variation Simple clock dividers leveraging existing clock distribution methods Route multiple power supplies to allow rapid switching and avoid overhead of on-chip power regulation

Burleson/UMASS32 Ongoing Work Satellite Set-top Box application Developed at Hughes Networks using 7 distinct RISC cores. Compare ASOC with in-house shared memory approach for interconnections. New and more complete wireless and multimedia systems Jpeg2000, mpeg-4, 3d Graphics, … ASOC parameter optimization Tile sizes, bus widths, clocks, VDDs Coping with Core irregularity Size, I/O positions, shapes, bus widths, communication interfaces Interconnect circuit optimization (NoCIC) Leakage Power issues Reliability, Test, Fault-Tolerance and Security Compilation: especially Partitioning, Mapping Prototypes:.18u MOSIS of communication interface, ~25K transistors, verification of interface logic and timing ASOC in Education: Circuits, architecture and core design projects

Burleson/UMASS33 Implications (perhaps controversial ) Multi-core architectures will be needed to maintain Moore’s law (interconnect, memory, parallelism) Task-based parallelism may be easier to program, extract and implement than data parallelism (think multi-core rather than instruction level parallelism) Global coarse synchronization provides an approach to hard-real time computing for dynamic workloads (ie video coding). Dynamic Power savings exploiting fine-grain workload variations can be achieved through straightforward clock and power scaling methods. Interconnect standards will be specified by silicon foundries similar to cell libraries and memories

Burleson/UMASS34 Design Flow Architecture to Layout Architecture: Block diagram of system and behavioral description Logic: Gate level or schematic description Circuit: Transistor configurations and sizings Layout: Floorplanning, clock and power distribution Tools VerilogXL: behavioral representation VTVT: standard cell library Synopsys: standard cell gate level netlist generation Silicon Ensemble: standard cell netlist to layout Cadence LayoutPlus: schematic and layout design NCSU CDK: design and extraction rules Cadence Layout vs. Schematic: layout verification HSPICE: circuit simulator