© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.

Slides:



Advertisements
Similar presentations
The CPU The Central Presentation Unit What is the CPU?
Advertisements

Control Unit Implemntation
Comparison of Altera NIOS II Processor with Analog Device’s TigerSHARC
Instruction Set Design
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Reconfigurable Computing (EN2911X, Fall07) Lecture 04: Programmable Logic Technology (2/3) Prof. Sherief Reda Division of Engineering, Brown University.
1 Microprocessor History. 2 The date is the year that the processor was first introduced. Many processors are re- introduced at higher clock speeds for.
CENTRAL PROCESSING UNIT
Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.
Term Project Overview Yong Wang. Introduction Goal –familiarize with the design and implementation of a simple pipelined RISC processor What to do –Build.
Chapter 16 Control Unit Operation No HW problems on this chapter. It is important to understand this material on the architecture of computer control units,
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
Programmable logic and FPGA
Chapter 16 Control Unit Implemntation. A Basic Computer Model.
Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.
Railway Foundation Electronic, Electrical and Processor Engineering.
GCSE Computing - The CPU
What’s on the Motherboard? The two main parts of the CPU are the control unit and the arithmetic logic unit. The control unit retrieves instructions from.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
© 2010 Altera Corporation—Public DSP Innovations in 28-nm FPGAs Danny Biran Senior VP of Marketing.
INTRODUCTION TO MICROCONTROLLER. What is a Microcontroller A microcontroller is a complete microprocessor system, consisting of microprocessor, limited.
Processing Devices.
Writer:-Rashedul Hasan Editor:- Jasim Uddin
Computer Processing of Data
Introduction to Computing: Lecture 4
Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto Computer Engineering Research Group February 22, 2010.
CS 1308 Computer Literacy and the Internet Computer Systems Organization.
Titan: Large and Complex Benchmarks in Academic CAD
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.
Computers Are Your Future Eleventh Edition Chapter 2: Inside the System Unit Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall1.
Introduction to Computing Systems from bits & gates to C & beyond The Von Neumann Model Basic components Instruction processing.
1Copyright © Prentice Hall 2000 The Central Processing Unit Chapter 3 What Goes on Inside the Computer.
J. Christiansen, CERN - EP/MIC
© 2010 Altera Corporation—Public Easily Build Designs Using Altera’s Video and Image Processing Framework 2010 Technology Roadshow.
Lesson 3 — How a Computer Processes Data Unit 1 — Computer Basics.
Introduction to Computer Architecture. What is binary? We use the decimal (base 10) number system Binary is the base 2 number system Ten different numbers.
Lecture #3 Page 1 ECE 4110–5110 Digital System Design Lecture #3 Agenda 1.FPGA's 2.Lab Setup Announcements 1.HW#2 assigned Due.
CHAPTER 4 The Central Processing Unit. Chapter Overview Microprocessors Replacing and Upgrading a CPU.
SKILL AREA: 1.2 MAIN ELEMENTS OF A PERSONAL COMPUTER.
CS 1308 Computer Literacy and the Internet. Objectives In this chapter, you will learn about:  The components of a computer system  Putting all the.
Basic Elements of Processor ALU Registers Internal data pahs External data paths Control Unit.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Teaching Digital Logic courses with Altera Technology
Simple ALU How to perform this C language integer operation in the computer C=A+B; ? The arithmetic/logic unit (ALU) of a processor performs integer arithmetic.
بسم الله الرحمن الرحيم MEMORY AND I/O.
© 2009 Altera Corporation Floating Point Synthesis From Model-Based Design M. Langhammer, M. Jervis, G. Griffiths, M. Santoro.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
Computer Operation. Binary Codes CPU operates in binary codes Representation of values in binary codes Instructions to CPU in binary codes Addresses in.
BASIC COMPUTER ARCHITECTURE HOW COMPUTER SYSTEMS WORK.
© 2008 Altera Corporation—Public 40-nm Stratix IV FPGAs Innovation Without Compromise.
Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.
Types of Micro-operation  Transfer data between registers  Transfer data from register to external  Transfer data from external to register  Perform.
1 Comparing FPGA vs. Custom CMOS and the Impact on Processor Microarchitecture Henry Wong Vaughn Betz, Jonathan Rose.
STUDY OF PIC MICROCONTROLLERS.. Design Flow C CODE Hex File Assembly Code Compiler Assembler Chip Programming.
Computer Hardware – System Unit
Head-to-Head Xilinx Virtex-II Pro Altera Stratix 1.5v 130nm copper
The Central Processing Unit
Assembly Language for Intel-Based Computers, 5th Edition
Processor Architecture: Introduction to RISC Datapath (MIPS and Nios II) CSCE 230.
Components of Computer
Microcomputer Architecture
Instructions at the Lowest Level
Teaching Computing to GCSE
T Computer Architecture, Autumn 2005
Programmable Logic- How do they do that?
Chapter 5: Computer Systems Organization
Computer Architecture
Digital Circuits and Logic
Programmable logic and FPGA
Presentation transcript:

© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 2 Introduction Lutiac is an experimental soft processor Designed for very small programs  roughly 200 instructions  roughly 200 words of data Take a drastic step to reduce the size of the processor Measure its area and speed Compare to NIOS II

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 3 Typical Microprocessor ALU A registersB registers From Outside World To Outside World PC +1 Instruction Memory Decoder To Control Points

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 4 Typical Microprocessor Typical Microprocessor consists of:  data path (registers, ALU,...)  controller (PC, instruction memory, decoder) Data path has control inputs  register file read addresses  register file write address  register file write enable  instruction is add/subtract/and/or/copy/...

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 5 Control Inputs Control inputs are driven from the decoder Decoder driven from current instruction Current instruction determined by program counter If instruction memory never changes:  current instruction is a constant function of the program counter  so control inputs depend entirely on the value of the program counter

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 6 Control Inputs Are Function of PC If we have small programs (≤ 64 total instructions)  program counter only needs 6 bits Each control input is a function of 6 PC bits  could be replaced by a 6-lut Entire decoder is a set of 6-luts Instruction memory isn’t needed at all, and can be removed

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 7 Drastic Step - Delete Instruction Memory ALU A registersB registers From Outside World To Outside World PC +1 Instruction Memory Decoder To Control Points X

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 8 Lutiac ALU A registersB registers From Outside World To Outside World PC +1 Decoder To Control Points

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 9 Another Way to Think About It At the point in a normal soft processor where the instruction is read from the instruction memory: instruction = instruction_memory[pc]; if(instruction is this) do this; if(instruction is that) do that;... Replace by a case statement based on the pc: case(pc) 0:do this; 1:do that; 2:do the other thing;...

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 10 Lutiac Implementation Built a very simple prototype 16-bit processor that uses hard-wired programs instead of an instruction memory 3 stage pipeline  decode: sets read addresses on register file  execute: computes results, sets up register file writes  write back: register file write One cycle per instruction

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 11 Lutiac Implementation No data memory, just registers  no fixed instruction format, so no hard limit on number of registers One input port from outside world, one output port Simple assembler converts my_program.s file into an equivalent Verilog processor description

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 12 Experiments Measure size and speed of Lutiac, varying:  number of different kinds of instructions in the program  size of the program  number of registers used Used Quartus 8.0 (2 years ago now) Stratix IV chips of various sizes, fastest speed grade  Each Stratix IV LAB contains 20 FFs + roughly 10 6-LUTs  Some LABs can be re-configured as 640 bit RAMs known as “MLABs” Will compare to NIOS II at the end, but for now, remember that a medium sized NIOS II uses 58 LABs and 11 M9K rams

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 13 Lutiac Size vs. Instruction Mix Each program contains 64 random instructions, chosen from the allowed instruction types

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 14 Fmax vs. Instruction Mix

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 15 Effect of Program Size Size grows linearly as program size increases beyond 64 instructions, roughly 1 LAB for every 20 additional instructions

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 16 Effect of Number of Registers Very large Lutiac (512 random instructions) grows by the number of MLABs needed to hold additional registers Would save area if we used M9Ks instead of MLABs once we needed more than bit registers

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 17 Scalability of Multiple Lutiac Cores Chained N identical 64 instruction Lutiac cores together  LABs grow by 14.5 per core  Fmax drops as Quartus placement worsens  Ran out of DSP blocks above 256 cores

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. Comparison to NIOS II Very inexact  NIOS II is 32 bits, Lutiac is 16 bits  NIOS II also has memory interfaces, caches, traps,... Configure NIOS II systems with 4K bytes of RAM  allows up to 1K words of instructions or data Lutiac has no RAM, all instructions and data in MLABs Lutiac and NIOS II both use four 18x18 multipliers (Multiplier/Accumulate mode) 18

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 19 Comparison to NIOS II

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 20 Comparison to NIOS II Back of the envelope guess (± factor of 2x) Un-optimized 32-bit Lutiac is nearly twice the size of a 16- bit Lutiac (25 LABs);.75 the speed (177 MHz) 32-bit Lutiac/NIOS IIs speed ratio = (177 / 235) area ratio of Lutiac/NIOS IIs  (25 LABs + DSP) / (58 LABs + 11 M9K RAMs + DSP) =.3 32-bit Lutiac/NIOS IIs throughput/area  (177/235) /.3 = 2.5x 32-bit Lutiac/NIOS IIe throughput/area  NIOS IIe is smallest NIOS, but isn’t pipelined, so has 5 cycles/instruction  (177/368 * 5/1) / ((25 LABs + DSP) / (37 LABs + 6 M9K RAMs)) = 4.5x

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 21 Lutiac Disadvantages Limited to very small programs (200 instructions or so) Must re-synthesize circuit every time program changes  instruction memory replaced by LUTs  would need good simulation tools  or a debug version of the processor that did have an instruction memory

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 22 Lutiac Advantages Circuit is smaller, less complex than standard soft processor One less stage in the pipeline  no instruction memory read required Program contents are exposed to logic synthesis  data path components that aren’t used will be removed by synthesis  circuit may be smaller and faster

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 23 Lutiac Advantages Flexible and powerful  wide range of useful instructions can be available  if not used by program, they will be synthesized away  easy to add specialized instructions if needed Not limited by a fixed instruction word width or encoding  can use as many registers as the program wants

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 24 Lutiac Advantages Processor self configures based on program  no “mega-wizard” needed  if multiplier/adder/etc. isn’t used, synthesis will leave it out Data path can adapt to the program Examples:  if program ever references a register immediately after writing to it, create a bypass register; else leave bypass register out of circuit  if multiplier and adder were used in parallel, create a separate copy of the register file for the multiplier; else have it share the adder’s register file

© 2010 Altera Corporation - Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 25 Conclusions For small programs, it is possible to build 16-bit soft processors using only LABs (plus multiplier)  smaller and faster than smallest 32-bit NIOS II (37 LABs, 6 M9K RAMs)  with instructions/second on the same order as the mid-size NIOS II (58 LABs, 11 M9K RAMs)  size advantage over NIOS II disappears as program size approaches 1000 instructions