High Performance Embedded Computing © 2007 Elsevier Lecture 15: Embedded Multiprocessor Architectures Embedded Computing Systems Mikko Lipasti, adapted.

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

ARCHITECTURE OF APPLE’S G4 PROCESSOR BY RON WEINWURZEL MICROPROCESSORS PROFESSOR DEWAR SPRING 2002.
VADA Lab.SungKyunKwan Univ. 1 L3: Lower Power Design Overview (2) 성균관대학교 조 준 동 교수
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
System on Chip (SOC).
Lecture Objectives: 1)Explain the limitations of flash memory. 2)Define wear leveling. 3)Define the term IO Transaction 4)Define the terms synchronous.
TigerSHARC and Blackfin Different Applications. Introduction Quick overview of TigerSHARC Quick overview of Blackfin low power processor Case Study: Blackfin.
L27:Lower Power Algorithm for Multimedia Systems 성균관대학교 조 준 동
Embedded Computer Architecture 5KK73 TU/e Henk Corporaal
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Introduction.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)
Trevor Burton6/19/2015 Multiprocessors for DSP SYSC5603 Digital Signal Processing Microprocessors, Software and Applications.
A System Solution for High- Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztian.
1 SODA: A Low-power Architecture For Software Radio Yuan Lin 1, Hyunseok Lee 1, Mark Woh 1, Yoav Harel 1, Scott Mahlke 1, Trevor.
Optimization Of Power Consumption For An ARM7- BASED Multimedia Handheld Device Hoseok Chang; Wonchul Lee; Wonyong Sung Circuits and Systems, ISCAS.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Presented by Santosh Ponnala
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai.
High Performance Embedded Computing © 2007 Elsevier Chapter 5, part 1: Multiprocessor Architectures High Performance Embedded Computing Wayne Wolf.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Mobile Handset Hardware Architecture
Computer performance.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
1 HW-SW Framework for Multimedia Applications on MPSoC: Practice and Experience Adviser : Chun-Tang Chao Adviser : Chun-Tang Chao Student : Yi-Ming Kuo.
Chapter 1 CSF 2009 Computer Abstractions and Technology.
2006 Chapter-1 L2: "Embedded Systems - Architecture, Programming and Design", Raj Kamal, Publs.: McGraw-Hill, Inc. 1 Introduction to Embedded Systems –
1 Copyright © 2011, Elsevier Inc. All rights Reserved. Appendix E Authors: John Hennessy & David Patterson.
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
ECE 720T5 Fall 2012 Cyber-Physical Systems Rodolfo Pellizzoni.
Integrating Fine-Grained Application Adaptation with Global Adaptation for Saving Energy Vibhore Vardhan, Daniel G. Sachs, Wanghong Yuan, Albert F. Harris,
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
A DSP-Based Platform for Wireless Video Compression Patrick Murphy, Vinay Bharadwaj, Erik Welsh & J. Patrick Frantz Rice University November 18, 2002.
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
Develop and Implementation of the Speex Vocoder on the TI C64+ DSP
SYSTEM-ON-CHIP (SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY.
Paper Review: XiSystem - A Reconfigurable Processor and System
3G Single Core Modem A New Telecommunications Device Group 4: Warren Irwin, Austin Beam, Amanda Medlin, Rob Westerman, Brittany Deardian.
Computer Architecture Lecture 30 Fasih ur Rehman.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics Testability and architecture. Design methodologies. Multiprocessor system-on-chip.
Android is a trademark of Google Inc. Use of this trademark is subject to Google Permissions. Linux® is the registered trademark of Linus Torvalds in the.
High Performance Embedded Computing © 2007 Elsevier Lecture 3: Design Methodologies Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based.
High Performance Embedded Computing © 2007 Elsevier Chapter 1, part 2: Embedded Computing High Performance Embedded Computing Wayne Wolf.
High Performance Embedded Computing © 2007 Elsevier Lecture 18: Hardware/Software Codesign Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Lu Hao Profiling-Based Hardware/Software Co- Exploration for the Design of Video Coding Architectures Heiko Hübert and Benno Stabernack.
Chapter 1 Introduction. Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 1, Slide 2 Learning Objectives  Why process signals.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
The Evolution of TMS, Family of DSP’s
DSP base-station comparisons. Second generation (2G) wireless 2 nd generation: digital: last decade: 1990’s Voice and low bit-rate data –~14.4 – 28.8.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 8 Networks and Multiprocessors.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.
Multiprocessor SoC integration Method: A Case Study on Nexperia, Li Bin, Mengtian Rong Presented by Pei-Wei Li.
SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Embedded Systems. What is Embedded Systems?  Embedded reflects the facts that they are an integral.
ECE354 Embedded Systems Introduction C Andras Moritz.
Introduction.
CS775: Computer Architecture
*.
Lecture 4- Threads, SMP, and Microkernels
Nov. 12, 1997 Bob Brodersen ( CS 152 Computer Architecture and Engineering Introduction to Architectures for Digital.
Computer Evolution and Performance
Presentation transcript:

High Performance Embedded Computing © 2007 Elsevier Lecture 15: Embedded Multiprocessor Architectures Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based on slides and textbook from Wayne Wolf

© 2006 Elsevier Topics Overview and Motivation. Embedded Multiprocessor Design Techniques Embedded Multiprocessor Architectures. Processing Elements

© 2006 Elsevier Generic multiprocessors Shared memory: Message passing: PE mem PE mem PE mem … … Interconnect network PE mem PE mem PE mem … Interconnect network

© 2006 Elsevier Design choices Processing elements:  Number.  Type.  Homogeneous or heterogeneous. Memory:  Size and configuration.  Shared or. private memories. Interconnection networks:  Topology.  Protocol.

© 2006 Elsevier Why embedded multiprocessors? Real-time performance---segregate tasks to improve predictability and performance. Low power/energy---segregate tasks to allow idling, segregate memory traffic. Cost---several small processors may be more efficient than one large processor.

© 2006 Elsevier Example: cell phones Variety of tasks:  Error detection and correction.  Voice compression/decompression.  Protocol processing.  Position sensing.  Music.  Cameras.  Web browsing.

© 2006 Elsevier Example: video compression QCIF (177 x 144) used in cell phones and portable devices:  11 x 9 macroblocks of 16 x 16.  Frame rate of 15 or 30 frames/sec.  Seven correlations per macroblock = 177,408 pixel comparisons per frame.  Feig/Winograd DCT algorithm uses 94 multiplications and 454 additions per 8 x 8 2D DCT.

© 2006 Elsevier Austin et al.: portable supercomputer Next-generation workloads on portable device:  Speech compression.  Video compression and analysis.  High-resolution graphics.  High-bandwidth wireless communications. Workload is 10,000 SPECint = 16 x 2GHz Pentium 4. Power budget of 75 mW.

© 2006 Elsevier Performance trends on desktop [Aus04] © 2004 IEEE Computer Society

© 2006 Elsevier Energy trends on desktop [Aus04] © 2004 IEEE Computer Society

© 2006 Elsevier Specialization and multiprocessing Many embedded multiprocessors are heterogeneous:  Processing elements.  Interconnect.  Memory. Why use heterogeneous multiprocessors?  Some operations (8 x 8 DCT) are standardized.  Some operations are specialized.  High-throughput operations may require specialized units. Heterogeneity reduces power consumption. Heterogeneity improves real-time performance.

© 2006 Elsevier Multiprocessor design methodologies Analyze workload that represents system’s usage.  May include multiple programs. Platform-independent optimizations eliminate side effects due to reference software implementation. Platform design is based on operations, memory, etc. Software can be further optimized to take advantage of platform.

© 2006 Elsevier Cai and Gajski modeling levels Implementation: corresponds directly to hardware. Cycle-accurate computation: captures accurate computation times, approximate communication times. Time-accurate communication: captures communication times accurately but computation times only approximately. Bus-transaction: models bus operations but is not cycle-accurate. PE-assembly: communication is untimed, PE execution is approximately timed. Specification: functional model.

© 2006 Elsevier Multiprocessor systems-on-chips MPSoC is a complete platform for an application.  Platform is usually tailored for a particular application domain. Generally heterogeneous processing elements. Combine off-chip bulk memory with on-chip specialized memory.

© 2006 Elsevier Qualcomm MSM5100 Cell phone system-on- chip. Two CDMA standards, analog cell phone standard (AMPS). GPS, Bluetooth, music, mass storage.

© 2006 Elsevier Qualcomm MSM5100 Integration

© 2006 Elsevier Philips Viper Nexperia

© 2006 Elsevier Viper Nexperia characteristics Designed to decode 1920 x 1080 HDTV. Trimedia runs video processing functions. MIPS runs operating system. Synchronous DRAM interface for bulk storage. Variety of I/O devices. Accelerators: image composition, scaler, MPEG-2 decoder, video input processors, etc.

© 2006 Elsevier Lucent Daytona MIMD for signal processing apps. Processing element is based on SPARC V8.  DSP extensions Reduced precision vector unit has 16 x 64-bit vector register file. Reconfigurable 8KB level 1 cache  16 banks configured as I-cache, D-cache, or scratchpad Daytona split transaction bus.

© 2006 Elsevier Lucent Daytona PE SPARC V8 core  5 stage pipleine  Windowed register file – Eight 16-entry register windows plus 16 global registers.

© 2006 Elsevier STMicro Nomadik Designed for mobile multimedia. Accelerators built around MMDSP+ core:  One instruction per cycle.  16- and 24-bit fixed-point, 32-bit floating-point.

© 2006 Elsevier STMicro Nomadik accelerators video audio

© 2006 Elsevier TI OMAP Designed for mobile multimedia. C55x DSP performs signal processing as slave. ARM runs operating system, dispatches tasks to DSP.

© 2006 Elsevier TI OMAP 5912

© 2006 Elsevier Processing elements issues How many do we need? What types of processing elements do we need? Analyze performance/power requirements of each process in the application. Choose a processor type for each process. Determine what processes should share processing elements

© 2006 Elsevier Embedded Multiprocessor Questions Of the embedded multiprocessors we discussed in this lecture, which one seemed  The most general purpose? Why?  The most application-specific? Why? What are advantages and disadvantages of the configurable cache used in the Lucent Daytona architecture? What benefits do the accelerators in the Viper Nexperia processor provide? For what types of applications are accelerators most important?