Presented by Santosh Ponnala

Slides:



Advertisements
Similar presentations
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.
Structure of Computer Systems
Computer Abstractions and Technology
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
High Performance Embedded Computing © 2007 Elsevier Lecture 15: Embedded Multiprocessor Architectures Embedded Computing Systems Mikko Lipasti, adapted.
Some Thoughts on Technology and Strategies for Petaflops.
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Chapter 17 Parallel Processing.
EE314 Basic EE II Silicon Technology [Adapted from Rabaey’s Digital Integrated Circuits, ©2002, J. Rabaey et al.]
6/30/2015HY220: Ιάκωβος Μαυροειδής1 Moore’s Law Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips.
Computer Organization and Assembly language
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Computer performance.
Information and Communication Technology Fundamentals Credits Hours: 2+1 Instructor: Ayesha Bint Saleem.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics Testability and architecture. Design methodologies. Multiprocessor system-on-chip.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Lecture 1 1 Computer Systems Architecture Lecture 1: What is Computer Architecture?
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Computer Organization & Assembly Language © by DR. M. Amer.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
Computer performance issues* Pipelines, Parallelism. Process and Threads.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Multiprocessor SoC integration Method: A Case Study on Nexperia, Li Bin, Mengtian Rong Presented by Pei-Wei Li.
Multi-core, Mega-nonsense. Will multicore cure cancer? Given that multicore is a reality –…and we have quickly jumped from one core to 2 to 4 to 8 –It.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Lecture # 10 Processors Microcomputer Processors.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Hardware Architecture
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
William Stallings Computer Organization and Architecture 6th Edition
Lynn Choi School of Electrical Engineering
ECE354 Embedded Systems Introduction C Andras Moritz.
Parallel Processing - introduction
Lynn Choi School of Electrical Engineering
INTRODUCTION TO MICROPROCESSORS
Multi-Core Computing Osama Awwad Department of Computer Science
CS775: Computer Architecture
BIC 10503: COMPUTER ARCHITECTURE
Multicore / Multiprocessor Architectures
Chapter 1 Introduction.
Computer Evolution and Performance
Chapter 4 Multiprocessors
Presentation transcript:

Presented by Santosh Ponnala Multiprocessor System-on-Chip(MPSoC) Technology Wayne Wolf, Ahmed Amine Jerraya and Grant Martin Presented by Santosh Ponnala

Brief Overview Introduction Multiprocessors and the Evolution of MPSoCs How Applications Influence Architecture Architectures for Real-Time Low-Power Systems CAD Challenges in MPSoCs Conclusion

Introduction What is a MPSoC? Where are they used? System Requirements? Why MPSoC? What is a Multiprocessor? How is a MPSoC different from a Multiprocessor?

What is a Parallel Architecture A large collection of processing elements that communicate and cooperate to solve large problems fast. [-- Almasi and Gottlieb] “ collection of processing elements” How many? How powerful each? Scalability? “ that can communicate” How do PEs communicate? (shared memory vs message passing) Interconnection Networks (bus, crossbar, ..) Serial Computing Parallel Computing

Why Use Parallel Computing? Main Reasons: Save time and/or money Solve larger Problems Provide Concurrency Limits to serial computing

Taxonomy Of Parallel Computers

Vector vs Array Processing Let n be the size of each vector. Then, the time to compute f (V1, V2) = k + (n-1), where k is the length of the pipe f. Array Processing In array processing, each of the operations f (v1j, v2j) with the components of the two vectors is carried out simultaneously, in one step.

Early Multiprocessors CU = Control Unit , PE = Processing Element , PEM = PE Memory module. The machine was not fully operational until 1975. Between that time and 1981 it was the world's fastest computer. It performed Vector and Array operations in parallel. Speed of integration tracks Moore’s law: doubling every 18-24 months. Generic Model of Multiprocessors: A collection of Computers ( cpu + memory) communicating over an interconnect network. [ Culler et al.] Architecture of the ILLIAC IV

Why did uniproccesor performance grow so fast? ~ half from circuit improvement (smaller transistors, faster clock, etc.) • ~ half from architecture/organization: • Instruction Level Parallelism (ILP) Pipelining: RISC, CISC with RISC backend Superscalar Out of order execution • Memory hierarchy (Caches) Exploiting spatial and temporal locality Multiple cache levels

History of Multiprocessors 80s – early 90s: prime time for parallel architecture research A microprocessor cannot fit on a chip, so naturally need multiple chips (and processors) 90s: at the low end, uniprocessor system’s speed grows much faster than parallel system’s speed A microprocessor fits on a chip. So do branch predictor, multiple functional units, large caches, etc! Microprocessor also exploits parallelism (pipelining, multiple issue, VLIW) – parallelisms originally invented for multiprocessors 90s: emergence of distributed (vs. parallel) machines (Progress in network technologies:) Network bandwidth grows faster than Moore’s law Fast interconnection network getting cheap Connects cheap uniprocessor systems into a large distributed machine Network of Workstations, Clusters, GRID. 00s: parallel architectures are back Transistors per chip >> microproc transistors Harder to get more performance from a uniprocessor E.g. Intel Pentium D, Core Duo, AMD Dual Core, IBM Power5, Sun Niagara, etc.

History of MPSoCs 1. Lucent Daytona MPSoC Designed for wireless base stations, in which identical signal processing was performed on a number of data channels. Split transaction Bus. Processing element is based on SPARC V8. Reconfigurable L1 cache. SIMD Architecture

2. C-5 Network Processor Application: Packet Processing in Networks. Packets are handled by channel processors. Each cluster has 4 processors. Packet processors intercept individual IP data packets and process them using application software. Executive Processor: RISC CPU Operating Freq: 166MHz - 233MHz

3. Phillips Viper Nexperia Application: Multimedia Processing. Has two CPUs. master: MIPS PR3940 slave : Trimedia TM32 Has three buses. Memory controller for external DRAM interface and DMA units for each CPU. Can execute many OS including, Windows CE, Linux, VxWorks. CPUs share same resources and use semaphores to negotiate ownership of shared resources.

4. TI OMAP 5912 Application: Cell phone Processor. Designed to support 2.5G and 3G wireless applications. In addition to basic voice services, it is intended for speech processing, location-based services, security, gaming, and multimedia. Has two CPUs: an ARM9 and a TMS320C55x digital signal processor (DSP) C55x DSP performs signal processing as slave. ARM runs operating system, dispatches tasks to DSP. SRAM capacity: 192 KB

5. STMicro Nomadik Designed for mobile multimedia. Accelerators built around MMDSP+ core: One instruction per cycle. 16- and 24-bit fixed-point, 32-bit floating-point. Host Processor : ARM926EJ Two programmable accelerators on the bus. Video Accelerator is a heterogeneous MP Video accelerator Audio accelerator

Moore’s Law A law of physics A law of process technology A law of micro-architecture A law of psychology Most of us are familiar with Moore’s Law growth of transistors • Other characteristics appear to have reached a ceiling

Design Issues. Result of these trends. Design Complexity. Multiprocessors: Implementation Technology concerns (billion-transistor CMOS implementation technology) Design Issues. Transistor gate delay Interconnect delay* Exponential increase in processor clock rates Result of these trends. Design Complexity.

UltraSparc Niagra 8 CPU Cores Only a single floating point Unit 4 DDR2 Busses 4-way L2 Cache Built in self-test Operated at 1.4 GHz Capable of processing up to 32 concurrent threads.

Comparing Alternative Multiprocessor Architectures Superscalar SMP Logic, Wire and Design Complexity will increasingly favor CMP over Superscalar and SMT implementations. CMP Parallel vs Distributed Computers

Characteristics of Superscalar, SMT, and CMP architectures

How to use increasing Transistors Year Processor Transistors Feature size Data Width Frequency Features 1971 4004 2300 104nM 4 740 KHz First Microprocessor 1978 8086 29000 3000nM 16 10MHz IBM PC/AT 1985 80386 275000 1000nM 32 33MHZ Pipelining 1989 80486 1200000 800nM 100MHz Integral FPU 1993 Pentium 3100000 150 MHz On-Chip L1 Cache; Superscalar 1995 Pentium Pro 5500000 600nM 200MHz Out-of-order execution 1997 Pentium MMX P55C 4500000 350nM 450MHz Dynamic branch prediction; MMX (SIMD) instructions 1999 Pentium III 28000000 180nM 1.1GHz On-chip L2 Cache 2004 Pentium 4E 125000000 90nM 3.8 GHz Hyper-threading 2006 Xeon Tulsa 167000000 65nM 64 3.4 GHz Dual-Core 2010 Xeon 7500 Nehalem 2300000000 45nM 2.26GHz Eight-Cores

Multi-nonsense Multi-core was a solution to a performance problem Hardware works sequentially Make the hardware simple – thousands of cores Do in parallel at a slower clock rate to save power ILP is dead Examine what is (rather than what can be) Communication: off-chip hard, on-chip easy Abstraction is a pure good Programmers are all dumb and need to be protected Thinking in parallel is hard

Power and Memory considerations

Performance Improvements Computer Engineers improve performance through the reduction of C/I I/P is the domain of CS – writing software S/C is the domain of EE/VLSI – IC fabrication • CPI or C/I is improved through getting more instructions done in each cycle • This means doing work in parallel distributed across the functional units of the IC

How Applications Influence Architecture Complex Applications Nature of the computations Eg. MPEG-2 encoder. Memory bandwidth requirements of an encoder vary across the block diagram. MPEG-2 encoder Standard based design Many high-volume markets are standards-driven: wireless multimedia networking. Standard defines the basic I/O requirements. Real time operation. Low power/energy operation. Standards committees often provide reference implementations ( very single threaded).

Platform based design What is a Platform? A partial design: for a particular type of system includes embedded processor(s), may include embedded software customizable to a customer’s requirements: software component changes Why Platforms? Any given space has a limited number of good solutions to its basic problems. A platform captures the good solutions to the important design challenges in that space. A platform reuses architectures. Standards encourage platform-based design.

Alternative to platforms General-purpose architectures. May require much more area to accomplish the same task. Often much less energy-efficient. Reconfigurable systems. Good for pieces of the system, but tough to compete with software for miscellaneous tasks. Intel Xilinx

Platform vs. full-custom Platform has many fewer degrees of freedom: harder to differentiate can analyze design characteristics. Full-custom: extremely long design cycles may use less aggressive design styles if you can’t reuse some pieces. Costs of platform-based design Masks. design of the platform + customization. Design verification.

Platform based Design (reduces cost) Divide system design into 2 phases design a platform for a class of applications adapt the platform for a particular product in that application space Homogeneous MP vs Heterogeneous MP Examples of platforms: Data rate Power and energy consumption Buffering and Memory Management Product Design -- S/W Driven (Customization) Usefulness of platform depends largely on the quality and capabilities of the SDE.

Architectures for Real-time Low-Power Systems Performance and Power efficiency Benchmarks: high-performance data networking, voice recognition, video compression/ decompression, and other applications Power consumption trends for desktop processors from Austin et al. [Aus04] 2004 IEEE Computer Society

Architectures for Real-time Low-Power Systems (contd.) Real-Time Performance Homogeneous Architecture Heterogeneous Architecture Eg. Shared Memory MP Software methods to eliminate conflicts. Application Structure Homogeneous vs Heterogeneous Architecture

CAD Challenges in MPSoCs Configurable processors and instruction set synthesis. CPU configuration ( tools that generate a HDL) Coarse grained and fine grained instruction ext. Eg. MIMOLA, LISA, Tensilica Xtensa. Instruction set synthesis 1% rule [ Holmer and Despain]

CAD Challenges in MPSoCs (contd.) 2. Encoding Signal Encoding improves area & power consumption. Eg. Code Compression [ Wolfe and Chanin (Huffman)] and bus encoding. Data Compression (more complex) Eg. Lempel- Ziv Compression (L3 - MM) Bus-Invert Coding (Stan and Burleson)

CAD Challenges in MPSoCs (contd.) 3. Interconnect-driven design Early SoCs were driven by design approach. Interconnect choices are based on conventional bus concepts. Bus? (Single set of wires shared among multiple devices) Best known SoC buses : ARM AMBA, IBM CoreConnect. Growth in complexity of SoCs (Communication Bottleneck) Network on chip (NoC) Use a hierarchical N/W with routers for data communication Single shared Bus vs Multiple Communication Channels Eg. Sonics SiliconBackplane (TDMA style interconnection n/w)

CAD Challenges in MPSoCs (contd.) 4. Memory system optimizations Cache : everything ( placement, replacement, allocation and WB) is managed by hardware. vs Scratchpad: everything is managed by software. Servers, general purpose systems use caches. Scratchpad provides predictability of hits/ misses. Important for ensuring real time property. Complexity increases with applications. Worst case time is more tightly bound.

CAD Challenges in MPSoCs (contd.) 5. Hardware/ Software Codesign Used to explore design space of heterogeneous MP Cost estimation ( area, power & performance) 6. SDEs SDEs for single processors ( commercial and open-source) No comparable retargeting technology for multiprocessors MPSoC development environments tend to be a collection of tools. ( no substantial connection) Difficult to determine the true state of the system.

Conclusion MPSoCs are an important chapter in the history of multiprocessing System Designers like uniprocessors with sufficient computation power. DSPs (Audio Processing) Von Neumann architecture supports traditional software development tools Computational power (Moore’s Law) vs low –power, low- cost, real time requirements.