Gordon Bell Bay Area Research Center Microsoft Corporation

Slides:



Advertisements
Similar presentations
Gordon Bell Bay Area Research Center Microsoft Corporation
Advertisements

U Computer Systems Research: Past and Future u Butler Lampson u People have been inventing new ideas in computer systems for nearly four decades, usually.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Embedded System Lab. What is an embedded systems? An embedded system is a computer system designed for specific control functions within a larger system,
Beowulf Supercomputer System Lee, Jung won CS843.
Types of Parallel Computers
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Some Thoughts on Technology and Strategies for Petaflops.
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
Revolution Yet to Happen1 The Revolution Yet to Happen Gordon Bell & James N. Gray (from Beyond Calculation, Chapter 1) Rivier College, CS699 Professional.
EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.
6/30/2015HY220: Ιάκωβος Μαυροειδής1 Moore’s Law Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips.
UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
EE141 © Digital Integrated Circuits 2nd Introduction 1 The First Computer.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Computer performance.
Department of Computer and Information Science, School of Science, IUPUI Dale Roberts, Lecturer Computer Science, IUPUI CSCI.
Chapter 1 CSF 2009 Computer Abstractions and Technology.
1 Computing platform Andrew A. Chien Mohsen Saneei University of Tehran.
RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1
Computer System Architectures Computer System Software
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
1 Recap (from Previous Lecture). 2 Computer Architecture Computer Architecture involves 3 inter- related components – Instruction set architecture (ISA):
1 Computer System Organization I/O systemProcessor Compiler Operating System (Windows 98) Application (Netscape) Digital Design Circuit Design Instruction.
March 22, 2000Dr. Thomas Sterling, Caltech1. Networking Options for Beowulf Clusters Dr. Thomas Sterling California Institute of Technology and NASA Jet.
J. Christiansen, CERN - EP/MIC
© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.
VLSI DESIGN CONFERENCE 1998 TUTORIAL Embedded System Design and Validation: Building Systems from IC cores to Chips Rajesh Gupta University of California,
Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.
20 October Management of Information Technology Chapter 6 Chapter 6 IT Infrastructure and Platforms Asst. Prof. Wichai Bunchua.
EE3A1 Computer Hardware and Digital Design
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
Computer Organization & Assembly Language © by DR. M. Amer.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Present – Past -- Future
EE141 © Digital Integrated Circuits 2nd Introduction 1 Principle of CMOS VLSI Design Introduction Adapted from Digital Integrated, Copyright 2003 Prentice.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Embedded Systems. What is Embedded Systems?  Embedded reflects the facts that they are an integral.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
ECE354 Embedded Systems Introduction C Andras Moritz.
Architecture & Organization 1
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
CS775: Computer Architecture
What is Parallel and Distributed computing?
Software Defined Networking (SDN)
Architecture & Organization 1
Constructing a system with multiple computers or processors
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Chapter 1 Introduction.
Chapter 4 Multiprocessors
Virtualization Dr. S. R. Ahmed.
NetFPGA - an open network development platform
Presentation transcript:

Gordon Bell Bay Area Research Center Microsoft Corporation Options for embedded systems. Constraints, challenges, and approaches HPEC 2001 Lincoln Laboratory 25 September 2001 Gordon Bell Bay Area Research Center Microsoft Corporation

More architecture options: Applications, COTS (clusters, computers… chips), Custom Chips…

The architecture challenge: “One person’s system, is another’s component.”- Alan Perlis Kurzweil: predicted hardware will be compiled and be as easy to change as software by 2010 COTS: streaming, Beowulf, and www relevance? Architecture Hierarchy: Application Scalable components forming the system Design and test Chips: the raw materials Scalability: fewest, replicatable components Modularity: finding reusable components

The architecture levels & options The apps Data-types: “signals”, “packets”, video, voice, RF, etc. Environment: parallelism, power, power, power, speed, … cost The material: clock, transistors… Performance… it’s about parallelism Program & programming environment Network e.g. WWW and Grid Clusters Storage, cluster, and network interconnect Multiprocessors Processor and special processing Multi-threading and multiple processor per chip Instruction Level Parallelism vs Vector processors

Sony Playstation export limiits A problem X-Box would like to have, … but have solved.

Will the PC prevail for the next decade as a/the dominant platform Will the PC prevail for the next decade as a/the dominant platform? … or 2nd to smart, mobile devices? Moore’s Law: increases performance; Bell’s Corollary reduces prices for new classes PC server clusters aka Beowulf with low cost OS kills proprietary switches, smPs, and DSMs Home entertainment & control … Very large disks (1TB by 2005) to “store everything” Screens to enhance use Mobile devices, etc. dominate WWW >2003! Voice and video become the important apps! C = Commercial; C’ = Consumer

Where’s the action? Problems? Constraints from the application: Speech, video, mobility, RF, GPS, security… Moore’s Law, networking, Interconnects Scalability and high performance processing Building them: Clusters vs DSM Structure: where’s the processing, memory, and switches (disk and ip/tcp processing) Micros: getting the most from the nodes Not ISAs: Change can delay Moore Law effect … and wipe out software investment! Please, please, just interpret my object code! System (on a chip) alternatives… apps drivers Data-types (e.g. video, video, RF) performance, portability/power, and cost

COTS: Anything at the system structure level to use? How are the system components e.g. computers, etc. going to be interconnected? What are the components? Linux What is the programming model? Is a plane, CCC, tank, fleet, ship, etc. an Internet? Beowulfs… the next COTS What happened to Ada? Visual Basic? Java?

Computing SNAP built entirely from PCs Legacy mainframes & minicomputers servers & terms Portables Legacy mainframe & minicomputer servers & terminals Wide-area global network Mobile Nets Wide & Local Area Networks for: terminal, PC, workstation, & servers Person servers (PCs) scalable computers built from PCs Person servers (PCs) Centralized & departmental uni- & mP servers (UNIX & NT) Centralized & departmental servers buit from PCs ??? Here's a much more radical scenario, but one that seems very likely to me. There will be very little difference between servers and the person servers or what we mostly associate with clients. This will come because economy of scale is replaced by economy of volume. The largest computer is no longer cost-effective. Scalable computing technology dictates using the highest volume, most cost-effective nodes. This means we build everything, including mainframes and multiprocessor servers from PCs! TC=TV+PC home ... (CATV or ATM or satellite) A space, time (bandwidth), & generation scalable environment

How Will Future Computers Be Built? Thesis: SNAP: Scalable Networks and Platforms Upsize from desktop to world-scale computer based on a few standard components Because: Moore’s law: exponential progress Standardization & Commoditization Stratification and competition When: Sooner than you think! Massive standardization gives massive use Economic forces are enormous

Five Scalabilities Size scalable -- designed from a few components, with no bottlenecks Generation scaling -- no rewrite/recompile or user effort to run across generations of an architecture Reliability scaling… chose any level Geographic scaling -- compute anywhere (e.g. multiple sites or in situ workstation sites) Problem x machine scalability -- ability of an algorithm or program to exist at a range of sizes that run efficiently on a given, scalable computer. Problem x machine space => run time: problem scale, machine scale (#p), run time, implies speedup and efficiency, Unclear whether scalability has any meaning for a real system. While the following are all based on some order of N, the engineering details of a system determine missing constants! A system is scalable if efficiency(n,x) =1 for all algorithms, number of processors, n, and problem sizes x. Fails to recognize cost, efficiency, and whether VLSCs are practical (affordable) in a reasonable time scale. Cost < O(N2) rules out the cross-point= O(N2), however latency is O(1); Omega O(N log N), Ring/Bus/Mesh O(N) Bandwidth required to be <O(logN) Supercomputer bandwidth are O(N)... no caching, hierarchies SIMD didn't scale, CM5 probably won't. 19:25 (4:00) -4:25 Compatibility with the future is important. No matter how much you build on standards, you want the next one to take all the programs (without recompiliation), files, and run them with no changes!

Why I gave up on large smPs & DSMs Economics: Perf/Cost is lower…unless a commodity Economics: Longer design time & life. Complex. => Poorer tech tracking & end of life performance. Economics: Higher, uncompetitive costs for processor & switching. Sole sourcing of the complete system. DSMs … NUMA! Latency matters. Compiler, run-time, O/S locate the programs anyway. Aren’t scalable. Reliability requires clusters. Start there. They aren’t needed for most apps… hence, a small market unless one can find a way to lock in a user base. Important as in the case of IBM Token Rings vs Ethernet.

What is the basic structure of these scalable systems? Overall Disk connection especially wrt to fiber channel SAN, especially with fast WANs & LANs

GB plumbing from the baroque: evolving from 2 dance-hall SMP & Storage model Mp — S — Pc : | : |—————— S.fc — Ms | : |— S.Cluster |— S.WAN — vs. MpPcMs — S.Lan/Cluster/Wan — :

SNAP Architecture---------- With this introduction about technology, computing styles, and the chaos and hype around standards and openness, we can look at the Network & Nodes architecture I posit.

ISTORE Hardware Vision System-on-a-chip enables computer, memory, without significantly increasing size of disk 5-7 year target: MicroDrive:1.7” x 1.4” x 0.2” 2006: ? 1999: 340 MB, 5400 RPM, 5 MB/s, 15 ms seek 2006: 9 GB, 50 MB/s ? (1.6X/yr capacity, 1.4X/yr BW) Integrated IRAM processor 2x height Connected via crossbar switch growing like Moore’s law 16 Mbytes; ; 1.6 Gflops; 6.4 Gops 10,000+ nodes in one rack! 100/board = 1 TB; 0.16 Tflops

The Disk Farm? or a System On a Card? 14" The 500GB disc card An array of discs Can be used as 100 discs 1 striped disc 50 FT discs ....etc LOTS of accesses/second of bandwidth A few disks are replaced by 10s of Gbytes of RAM and a processor to run Apps!!

The Promise of SAN/VIA/Infiniband http://www.ViArch.org/ Yesterday: 10 MBps (100 Mbps Ethernet) ~20 MBps tcp/ip saturates 2 cpus round-trip latency ~250 µs Now Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,… Fast user-level communication tcp/ip ~ 100 MBps 10% cpu round-trip latency is 15 us 1.6 Gbps demoed on a WAN

Top500 taxonomy… everything is a cluster aka multicomputer Clusters are the ONLY scalable structure Cluster: n, inter-connected computer nodes operating as one system. Nodes: uni- or SMP. Processor types: scalar or vector. MPP= miscellaneous, not massive (>1000), SIMD or something we couldn’t name Cluster types. Implied message passing. Constellations = clusters of >=16 P, SMP Commodity clusters of uni or <=4 Ps, SMP DSM: NUMA (and COMA) SMPs and constellations DMA clusters (direct memory access) vs msg. pass Uni- and SMPvector clusters: Vector Clusters and Vector Constellations

Courtesy of Dr. Thomas Sterling, Caltech

The Virtuous Economic Cycle drives the PC industry… & Beowulf Attracts suppliers Greater availability @ lower cost Competition Volume Standards DOJ Utility/value Innovation Creates apps, tools, training, Attracts users

BEOWULF-CLASS SYSTEMS Cluster of PCs Intel x86 DEC Alpha Mac Power PC Pure M2COTS Unix-like O/S with source Linux, BSD, Solaris Message passing programming model PVM, MPI, BSP, homebrew remedies Single user environments Large science and engineering applications

Lessons from Beowulf An experiment in parallel computing systems Established vision- low cost high end computing Demonstrated effectiveness of PC clusters for some (not all) classes of applications Provided networking software Provided cluster management tools Conveyed findings to broad community Tutorials and the book Provided design standard to rally community! Standards beget: books, trained people, software … virtuous cycle that allowed apps to form Industry begins to form beyond a research project Courtesy, Thomas Sterling, Caltech.

Designs at chip level… any COTS options? Substantially more programmability versus factory compilation As systems move onto chips and chip sets become part of larger systems, Electronic Design must move from RTL to algorithms. Verification and design of “GigaScale systems” will be the challenge.

The Productivity Gap Logic Transistors per Chip Trans./Staff - Month 10,000,000 100,000,000 .10m 1,000,000 10,000,000 100,000 58%/Yr. compound Complexity growth rate 1,000,000 Logic Transistors per Chip (K) .35m 10,000 100,000 Trans./Staff - Month Productivity 1,000 10,000 x 100 1,000 x x x x x x 10 21%/Yr. compound Productivity growth rate 100 2.5m 10 1 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 Logic Transistors/Chip Source: SEMATECH Transistor/Staff Month

What Is GigaScale? Extremely large gate counts High complexity Chips & chip sets Systems & multiple-systems High complexity Complex data manipulation Complex dataflow Intense pressure for correct , 1st time TTM, cost of failure, etc. impacts ability to have a silicon startup Multiple languages and abstraction levels Design, verification, and software

EDA Evolution: chips to systems GigaScale Architect 2005 (e.g. Forte) GigaScale Hierarchical Verification plus SOC Designer System Architect 1995 (Synopsys & Cadence) RTL 1M gates Testbench Automation Emulation Formal Verification plus ASIC Designer Chip Architect 1985(Daisy, Mentor) Gates 10K gates Simulation IC Designer 1975 (Calma & CV) Physical design Courtesy of Forte Design Systems

Processor Limit: DRAM Gap 60%/yr.. DRAM 7%/yr.. 1 10 100 1000 1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance “Moore’s Law” Y-axis is performance X-axis is time Latency Cliché: Not e that x86 didn’t have cache on chip until 1989 Alpha 21264 full cache miss / instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions Caches in Pentium Pro: 64% area, 88% transistors *Taken from Patterson-Keeton Talk to SigMod

The “memory gap” Multiple e.g. 4 processors/chip in order to increase the ops/chip while waiting for the inevitable access delays Or alternatively, multi-threading (MTA) Vector processors with a supporting memory system System-on-a-chip… to reduce chip boundary crossings

If system-on-a-chip is the answer, what is the problem? Small, high volume products Phones, PDAs, Toys & games (to sell batteries) Cars Home appliances TV & video Communication infrastructure Plain old computers… and portables Embeddable computers of all types where performance and/or power are the major constraints.

SOC Alternatives… not including C/C++ CAD Tools The blank sheet of paper: FPGA Auto design of a processor: Tensilica Standardized, committee designed components*, cells, and custom IP Standard components including more application specific processors *, IP add-ons plus custom One chip does it all: SMOP *Processors, Memory, Communication & Memory Links,

Tradeoffs and Reuse Model IUnknown IOleObject IDataObject IPersistentStorage IOleDocument IFoo IBar IPGood IOleBad System Application Application Implementation Structured Custom RTL Flow FPGA FPGA & GPP ASIP DSP Cost to Develop/Iterate New Application High Lower MOPS/mW High Low Time to Develop/Iterate New Application High Lower Programmability Low High Architecture Microarchitecture Platform Exportation Silicon Process

System-on-a-chip alternatives FPGA Sea of un-committed gate arrays Xylinx, Altera Compile a system Unique processor for every app Tensillica Systolic | array Many pipelined or parallel processors + custom Pc + ?? Dynamic reconfiguration of the entire chip… Pc+DSP | VLIW Spec. purpose processors cores + custom TI Pc & Mp. ASICS Gen. Purpose cores. Specialized by I/O, etc. IBM, Intel, Lucent Universal Micro Multiprocessor array, programmable I/0 Cradle, Intel IXP 1200

Xilinx 10Mg, 500Mt, .12 mic

Tensillica Approach: Compiled Processor Plus Development Tools ALU I/O Timer Pipe Cache Register File MMU Tailored, HDL uP core Using the processor generator, create... Describe the processor attributes from a browser-like interface Some of the possibilities here are hinted at by advanced commerical activities. In mid-Febrary, Tensilica announced their product whereby a designer can fill out a web page to specify the macro characteristics of an embedded processor. The user then executes a script that not only automatically synthesizes the layout of the desired processor, but also automatically creates a compiler, debugger, linker, and instruction set simulator tailored for the particular processor “designed” by the user. By controlling such aspects as cache size, data widths, and addressing modes, and with an ability to include “customer tooled” instructions into the processor, the user can evaluate an application on this cost, performance, and power optimized processor before requesting it to be fabricated. This technology is in its infancy today, but represent an important example of the way an integrated approach to hardware and software development can extend the capabilities of the technology for embedded systems. Today, the compiler cannot really take advantage of the new instructions--that is still up to the user--and the system says nothing about the RTOS or other runtime support, other than via a port of standard operating systems. (this slide is included with permission of Chris Rowen, CEO of Tensilica) Customized Compiler, Assembler, Linker, Debugger, Simulator Standard cell library targetted to the silicon process Courtesy of Tensilica, Inc. http://www.tensilica.com Richard Newton, UC/Berkeley

EEMBC Networking Benchmark Benchmarks: OSPF, Route Lookup, Packet Flow Xtensa with no optimization comparable to 64b RISCs Xtensa with optimization comparable to high-end desktop CPUs Xtensa has outstanding efficiency (performance per cycle, per watt, per mm2) Xtensa optimizations: custom instructions for route lookup and packet flow Colors: Blue-Xtensa, Green-Desktop x86s, Maroon-64b RISCs, Orange-32b RISCs

EEMBC Consumer Benchmark Benchmarks: JPEG, Grey-scale filter, Color-space conversion Xtensa with no optimization comparable to 64b RISCs Xtensa with optimization beats all processors by 6x (no JPEG optimization) Xtensa has exceptional efficiency (performance per cycle, per watt, per mm2) Xtensa optimizations:custom instructions for filters, RGB-YIQ, RGB-CMYK Colors: Blue-Xtensa, Green-Desktop x86s, Maroon-64b RISCs, Orange-32b RISCs

Free 32 bit processor core

Complex SOC architecture Synopsys via Richard Newton, UC/B

UMS Architecture Memory bandwidth scales with processing Must allow mix and match of applications. Design reuse is important thus scalability is a must. Resources must be balanced. Cradle is developing such an architecture which has multiple processors (MSPs) which are attached to private memories and can communicate with external devices through a Dram controller and programmable I/O. Explain architecture- Regular, Modular, Processing with Memory, High speed bus Memory bandwidth scales with processing Scalable processing, software, I/O Each app runs on its own pool of processors Enables durable, portable intellectual property

Cradle UMS Design Goals Minimize design time for applications Efficient programming model High reusability accelerates derivative development Cost/Performance Replace ASICs, FPGAs, ASSPs, and DSPs Low power for battery powered appliances Flexibility Cost effective solution to address fragmenting markets Faster return on R&D investments

Universal Microsystem (UMS) Quad 1 Quad 2 Quad 3 Quad 2 Quad 3 Global Bus I/O Quad Quad ‘n” I/O Quad SDRAM CONTROL PLA Ring The Universal Micro System (UMS) provides a revolutionary approach to developing feature-rich electronic systems in a cost-effective manner. The UMS family of off-the-shelf, silicon chips offers high performance computation and completely programmable high-speed interfaces, shifting the burden of system design from hardware- to software-based methodologies. The benefits of this shift include decreased time to market, significantly reduced development cost, and high reuse, which allows rapid and cost-effective creation of derivative products. Each Quad has 4 RISCs, 8 DSPs, and Memory Unique I/O subsystem keeps interfaces soft Quad “n”

Superior Digital Signal The Universal Micro System (UMS) An off the shelf “Platform” for Product Line Solutions Universal Micro System Superior Digital Signal Processing (Single Clock FP-MAC) Local Memory that scales with additional processors The UMS is a multiprocessor system-on-a-chip (SoC) with programmable I/O for interfacing to external devices. It consists of multiple processors hierarchically connected by two levels of buses. At the lower level, a cluster of processors called a Quad is connected by a local bus and shares local memory. Several Quads are connected by a very high-speed global bus. Quads communicate with external DRAM and I/O interfaces through a DRAM controller and a fully programmable I/O system, respectively. The UMS is a shared memory MIMD (multiple instruction/multiple data) computer that uses a single 32-bit address space for all register and memory elements. Each register and memory element in the UMS has a unique address and is uniquely addressable. Quad: Each Quad consists of four RISC-like processors called Processor Elements (PEs), eight DSP-like processors called Digital Signal Engines (DSEs), and one Memory Transfer Engine (MTE) with four Memory Transfer Controllers (MTCs). The MTCs are essentially DMA engines for background data movement. In addition, there is 32KB of instruction cache shared by PEs and 64KB of data memory, 32K of which can be optionally configured as cache, shared by Quads. The synchronization mechanism between processors is provided by 32 semaphore registers within each quad. Note that the Media Stream Processor (MSP) is a logical unit consisting of one PE and two DSEs. Processing ElementThe PE is a 32-bit processor with 16-bit instructions and 32 32-bit registers. The PE has a RISC-like instruction set consisting of both integer and IEEE 754 floating point instructions. The instructions have a variety of addressing modes for efficient use of memory. The PE is rated at approximately 90 MIPs. The PE uses the Quad program cache and Quad data memory in the RISC manner to give Harvard architecture (separate instruction and data memories) performance using a von Neumann (single instruction and data memory) programming model. The PEs are also ideally suited for control functions. They can setup data for processing by initiating and monitoring MTC data transfers. The MTCs transfer data between the local data memory and the SDRAM in the background. The PEs can also initiate and monitor DSE processing and coordinate between MTCs and DSEs. Digital Signal EngineThe DSE is a 32-bit processor with 128 registers and local program memory of 512 20-bit instructions optimized for high-speed fixed and floating point processing. It uses MTCs to automatically transfer data between the DSE and local or DRAM memory in the background. The DSE is the primary compute engine of the UMS and is rated at approximately 350 MIPs for integer or floating point performance. The DSE has 32-bit fixed and floating point instructions and four floating point multiplier-accumulator (MAC) units that can perform three floating point operations (float, floating multiply, and floating accumulate) in each clock cycle. The MAC unit has four accumulator registers to hold partial results. The 350 MIPs of floating point instructions translates into 1.05 gigaflops. It also has read and write data FIFOs that allow data pre-fetch from memory and post write to memory while executing instructions. This FIFO memory interface allows the DSE to maintain a high processing rate with efficient memory transfers. Caches and MemoriesEach Quad has 32KB of instruction cache and 64KB of data memory and cache. A feature of the data memory is that it can be partitioned into a local data memory or data cache (up to 32KB). Both caches are write-through caches with no allocate on write. The instruction and data caches are 128-way associative. The PEs share the instruction cache as the source of their instruction streams. Each PE executes 16-bit instructions at an average of four clocks per instruction. Each instruction fetch from cache yields 64 bits, or four instructions. With all four PEs executing instructions, the average instruction cache utilization is between 25 and 35 percent, depending on the code length between jumps. In addition, since the PEs share the same instruction and data cache, a mechanism to isolate instruction and data for each PE is provided. CommunicationEach Quad has two 64-bit local buses: an instruction bus and a data bus. The instruction bus connects the PEs and MTE to the instruction cache. The data bus connects the PEs, DSEs, and MTE to the local data memory. Both buses consist of a 32-bit address bus, a 64-bit write data bus, and a 64-bit read data bus. This corresponds to a sustained bandwidth of more than 2.8 Gbytes/s per bus. The MTE is a multithreaded DMA engine with four MTCs. Each MTC moves a block of data from a source address to a destination address. The MTE is a modified version of the DSE with four program counters (instead of one) as well as 128 registers and 2K of instruction memory. The MTCs also have special functional units for BitBLT, Reed Solomon, and CRC operations. SynchronizationEach Quad has 32 globally accessible semaphore registers that are allocated either statically or dynamically. The semaphore registers associated with a PE, when set, can also generate interrupts to the PE. Scalable real time functions in software using small fast processors (QUAD) Intelligent I/O Subsystem (Change Interfaces without changing chips) 250 MFLOPS/mm2

VPN Enterprise Gateway Five quads; Two 10/100 Ethernet ports at wire speed; one T1/E1/J1 interface Handles 250 end users and 100 routes Does key handling for IPSec Delivers 100Mbps of 3DES Firewall IP Telephony O/S for user interactions Single quad; Two 10/100 Ethernet ports at wire speed; one T1/E1/J1 interface Handles 250 end users and 100 routes Does key handling for IPSec Delivers 50Mbps of 3DES

UMS Application Performance Table 2: Performance of Kernels on UMS   UMS Application Performance Application MSPs Comments MPEG Video Decode 4 720x480, 9Mbits/sec 6 720x480, 15Mbits/sec MPEG Video Encode 10-16 322/1282 Search Area AC3 Audio Decode 1   Modems 0.5 V90 3 G.Lite ADSL Ethernet Router (Level 3 + QOS) Per 100Mb channel Per Gigabit channel Encryption 3DES 15Mb/s MD5 425Mb/s 3D geom, lite, render 1.6M Polygons/sec DV Encode/Decode 8 Camcorder Architecture permits scalable software Supports two Gigabit Ethernets at wire speed; four fast Ethernets; four T-1s, USB, PCI, 1394, etc. MSP is a logical unit of one PE and two DSEs

Cradle: Universal Microsystem trading Verilog & hardware for C/C++ UMS : VLSI = microprocessor : special systems Software : Hardware Single part for all apps App spec’d@ run time using FPGA & ROM 5 quad mPs at 3 Gflops/quad = 15 Glops Single shared memory space, caches Programmable periphery including: 1 GB/s; 2.5 Gips PCI, 100 baseT, firewire $4 per flops; 150 mW/Gflops

Silicon Landscape 200x Increasing cost of fabrication and mask $7M for high-end ASSP chip design Over $650K for masks alone and rising SOC/ASIC companies require $7-10M business guarantee Physical effects (parasitics, reliability issues, power management) are more significant design issues These must now be considered explicitly at the circuit level Design complexity and “context complexity” is sufficiently high that design verification is a major limitation on time-to-market Fewer design starts, higher-design volume… implies more programmable platforms The cost of a state-of-the-art fabrication facility continues to rise and it is estimated that a new 0.18m high-volume manufacturing plant costs approximately $2-3B today. This cost is driving a continued consolidation of the back-end manufacturing process.This increasing cost is also prejudicing the manufacturers towards parts that have guaranteed high-volume production form a single mask set (or that are likely to have high volume production, if successful.) This translates to better response time and higher priorities at times when global manufacturing resources are in short supply. In addition, the NRE costs associated with the design and tooling of complex chips are growing rapidly and the ITRS predicts that while manufacturing complex SOC designs will be practical, at least down to 50nm minimum feature sizes, the production of practical masks and exposure systems will likely be a major bottleneck for the development of such chips. A single mask set and probe card cost for a state-of-the-art chip is over $700K for a complex part today, up from less than $100K a decade ago (note: this does not include the design cost. It is esimated a complex custom “startup” ASSP chip design today costs at least $7M, including the masks). In addition, the cost of developing and implementing a comprehensive test for such complex designs will continue to represent an increasing fraction of a total design cost unless new approaches are developed. Design validation is still the limiting factor in both time-to-market for complex embedded systems, as well as in the predictability of time-to-market (often even more important). This is due to the increasingly significant effects of physics (e.g. affecting on-chip communication, reliable power distribution--"Do I actually get what I designed?") as well as the impact of increasing design complexity ("Do I actually get what I want?") In addition, the specification of the actual design requirement is often incomplete or is evolving over time. The ability of the designer to correct and turn a design quickly is increasingly important, both in the commercial sector as well as in complex military applications. These factors will increase the value of pre-characterized, optimized, and pre-verified microarchitecural families, as well as the preference for a programmable approach, if energy, delay, cost and reliability objectives can be met by a programmable solution. We believe that these factors are likely to lead to the creation of parameterized "standard programmable platforms" for the implementation of embedded systems, rather than the "unique assemblies of components" approach for each new design, which is the current stated goal of industry (especially elements of the EDA industry) Richard Newton, UC/Berkeley

The End

… … … Application(s) Instruction Set Architecture 360 SPARC 3000 “Physical Implementation” … General-Purpose Computing Platform-Based Design Application(s) … Microarchitecture & Software Physical Implementation Platform Application(s) … Verilog, VHDL, … ASIC FPGA Synthesizeable RTL

The Energy-Flexibility Gap 1000 Dedicated HW MUD 100-200 MOPS/mW 100 Reconfigurable Processor/Logic Pleiades 10-50 MOPS/mW MOPS/mW (or MIPS/mW) Energy Efficiency 10 ASIPs DSPs 1 V DSP 3 MOPS/mW 1 Embedded mProcessors LPArm 0.5-2 MIPS/mW 0.1 Flexibility (Coverage) Source: Prof. Jan Rabaey, UC Berkeley

Approaches to Reuse SOC as the Assembly of Components? Alberto Sangiovanni-Vincentelli SOC as a Programmable Platform? Kurt Keutzer

Component-Based Programmable Platform Approach Application-Specific Programmable Platforms (ASPP) These platforms will be highly-programmable They will implement highly-concurrent functionality  Intermediate language that exposes programmability of all aspects of the microarchitecture We believe that most components will be likely be predesigned all the way to parameterized layout. They will interact via an on-chip communication fabric, and a big opportunity here is the notion of programmable (or adaptive) on-chip communication. As we have seen from the ACS program, among other examples, the value of customizing the configuration of processors (whether by using programmable logic elements, or fixed but optimizeable communication mechanisms of other forms) is where a huge amount of power and performance improvement in these microarchitectures remains to be had. To include 3rd-party (possibly military) special-purpose IP, it is essential that we create open, flexible and efficient on-chip communication “standards.” The right way to do this, like with RISC, is to have them developed with a need in mind via an integrated research program and then let them flow to industry, via students/start-ups etc.  Integrate using programmable approach to on-chip communication Assembly language for Processor  Assemble Components from parameterized library Richard Newton, UC/Berkeley

Courtesy of Tensilica, Inc. Compact Synthesized Processor, Including Software Development Environment Use virtually any standard cell library with commercial memory generators Base implementation is less than 25K gates (~1.0 mm2 in 0.25m CMOS) Power Dissipation in 0.25m standard cell is less than 0.5 mW/MHz In the case of the Tensilica product, the processor is automatically synthesized in a standard-cell library. (this slide is included with permission of Chris Rowen, CEO of Tensilica) to scale on a typical $10 IC (3-6% of 60mm^2) Courtesy of Tensilica, Inc. http://www.tensilica.com

Challenges of Programmability for Consumer Applications Power, Power, Power…. Performance, Performance, Performance… Cost Can we develop approaches to programming silicon and its integration, along with the tools and methodologies to support them, that will allow us to approach the power and performance of a dedicated solution sufficiently closely (~2-4x?) that a programmable platform is the preferred choice? Richard Newton, UC/Berkeley

Bottom Line: Programmable Platforms The challenge is finding the right programmer’s model and associated family of micro-architectures Address a wide-enough range of applications efficiently (performance, power, etc.) Successful platform developers must “own” the software development environment and associated kernel-level run-time environment “It’s all about concurrency” If you could develop a very efficient and reliable re-programmable logic technology (comparable to ASIC densities), you would eventually own the silicon industry! Richard Newton, UC/Berkeley

Approaches to Reuse SOC as the Assembly of Components? Alberto Sangiovanni-Vincentelli SOC as a Programmable Platform? Kurt Keutzer Richard Newton, UC/Berkeley

A Component-Based Approach… Simple Universal Protocol (SUP) Unix pipes (character streams only) TCP/IP (only one type of packet; limited options) RS232, PCI Streaming… Single-Owner Protocol (SOP) Visual Basic Unibus, Massbus, Sbus, Simple Interfaces, Complex Application (SIC) When “the spec is much simpler than the code*” you aren’t tempted to rewrite it SQL, SAP, etc. Implies “natural” boundaries to partition IP and successful components will be aligned with those boundaries. (*suggested by Butler Lampson)

The Key Elements of the SOC Applications Distributed OS (Network) Software Development What is the Platform aka Programmer model? Microarchitecture Design Technology RF MEMS optical ASIP Richard Newton, UC/Berkeley

Power as the Driver (Power is still, almost always, the driver!) Four orders of magnitude 0.35m 0.25m 1m Source: R. Brodersen, UC Berkeley

Back end

Computer ops/sec x word length / $

Microprocessor performance 100 G 10 G Giga 100 M 10 M Mega Kilo Peak Advertised Performance (PAP) Moore’s Law Real Applied Performance (RAP) 41% Growth 1970 1980 1990 2000 2010 66

GigaScale Evolution In 1999 less than 3% of engineers doing designs with more than 10M transistors per chip. (Dataquest) By early 2002, 0.1 micron will allow 600M transistors per chip. (Dataquest) In 2001 49% of engineers @ .18 micron, 5% @ .10 micron. (EE Times) 54% plan to be @ .10 micron in 2003.(EET)

Challenges of GigaScale GigaScale systems are too big to simulate Hierarchical verification Distributed verification Requires a higher level of abstraction Higher abstraction needed for verification High level modeling Transaction-based verification Higher abstraction needed for design High-level synthesis required for productivity breakthrough