Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch.

Slides:



Advertisements
Similar presentations
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Advertisements

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Organization and Architecture
Computer Organization and Architecture
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Embedded Systems Programming
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
The ARM Microprocessor: A Little British Success Story Michelle Nabavian V Microprocessors Professor Robert Dewar Spring 2002.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Chapter 12 CPU Structure and Function. Example Register Organizations.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
ARM Series Intro A Glimpse Of ARM Fundamental Principals And Trend ZuMin(033230)
Lecture 15: Virtual Memory EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
The MIPS R10000 Superscalar Microprocessor Kenneth C. Yeager Nishanth Haranahalli February 11, 2004.
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.
Presented By: Rodney Fluharty Dec. 07, Who is ARM? Advanced Risc Microprocessor is the industry's leading provider of 16/32-bit embedded RISC microprocessor.
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
The Intel 86 Family of Processors
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
The Alpha – Data Stream Matt Ziegler.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
ARM 7 & ARM 9 MICROCONTROLLERS AT91 1 ARM920T Processor.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
Nios II Processor: Memory Organization and Access
Instruction Level Parallelism
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
5.2 Eleven Advanced Optimizations of Cache Performance
Introduction to Pentium Processor
Pipelining: Advanced ILP
CMSC 611: Advanced Computer Architecture
Page Replacement.
Alpha Microarchitecture
CSC3050 – Computer Architecture
ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.
Presentation transcript:

Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch prediction “TrustZone” security built-in to the CPU Instruction and data caches 8-stage pipeline 32-bit and 16-bit (“Thumb”) instruction sets, and “Jazelle” technology for Java execution

Memory Hierarchy Harvard architecture: separate data and instruction caches Allows simultaneous access 64-bit datapaths L1 Cache up to 64KB in size 4-way set associative virtual index, physical tag 8 words per line, critical word first on miss Round robin or pseudo-random replacement policy [1]

Level 2 Interface “high-bandwidth interface to second level caches, on-chip RAM, peripherals, and interfaces to external memory” [1] Level 2 interconnect 64-bit wide interfaces: Instruction Fetch Data Read/Write DMA Peripheral Interface is 32 bits wide

Translation Lookaside Buffer (TLB) MicroTLBs One each for instructions, data 10 entries Fully associative Round-robin or random replacement Single Main TLB Contains a fully-associative region of 8 lockable elements Misses handled by two-level page table

Coprocessor interface Core processor can interface to on-chip coprocessors Instruction set supports up to 16 coprocessors Two of these are used by the VFP Coprocessors intended to run in-step with core, share data Two-cycle delay: “generous timing margins” [1] Loose synchronization via token queues Core may flush coprocessor pipeline or cancel instructions Only one coprocessor “active” at one time Not so bad: calls to driver software = core instructions Allows much of the interface to be shared ($$$)

Coprocessor Synchronization [1]

VFP Coprocessor Uses a dedicated interface to the processor IEEE 754 Standard for Binary Floating-Point Arithmetic 64-bit load and store buses 3 independent, parallel pipelines: Load and store Multiply and accumulate Divide and square root Short vector instructions: 8 single precision, 4 double No branch instructions

Branch Prediction Branch Prediction (BP) can be turned on and off with a control register. Provides high level of control The ARM processor performs two types of BP Dynamic: performed in the Prefetch Unit Static: performed by the integer core (and the first time, before historical data exists) Branch folding After prediction, the branch instruction is completely removed from the instruction stream presented to the pipeline.

Dynamic Branch Prediction Dynamic Branch Prediction is the “first line” of branch prediction: if history exists, it will be used. The Branch Target Address Cache (BTAC) holds virtual target addresses of previous branches 128-entry, direct mapped cache Includes a 2-bit branch prediction history. A BTAC hit produces a branch prediction with zero cycle delay Both branches (resolved taken and not taken) are stored in the BTAC, which improves performance. Branch folding is done for almost all dynamically predicted branches.

Static Branch Prediction Static Branch Prediction is only based on the branch instruction characteristics (i.e., it does not utilize history) Simple: All forward conditional branches are not taken, and all backward branches are taken. “Around 65% of all branches are preceded by enough non- branch cycles to be completely predicted.” [1] The static branch predictor is used on compulsory misses (i.e., the first time a branch is encountered) when there are capacity or conflict misses in the BTAC

TrustZone The ARM1176 processors implement “TrustZone” security extensions that “provide a secure environment for software” [1] dddd [2] The hardware is partitioned so that the resources are physically separated on the chip, creating a strong boundary between the Normal World and the Secure World Two virtual processors are created from the one physical processor, removing the need for a separate processor dedicated to security TrustZone aware hardware such as DMA controllers allow secure data transfer Examples of how TrustZone can be used include secure PIN entry from the keyboard, to Digital Rights Management of multimedia data.

Integer Pipeline Up to 4 instructions fetched Static branch prediction in Fe2 Decode/Issue can hold branch alongside other instruction Non-blocking loads Hit Under Miss (HUM) buffer Up to 4 instructions fetched Static branch prediction in Fe2 Decode/Issue can hold branch alongside other instruction Non-blocking loads Hit Under Miss (HUM) buffer

Jazelle Java hardware acceleration Java bytecode translated to ARM instruction(s) Extra decode logic between Fetch and Decode stages Extension of ARM instruction set Limited (unpublished) subset of Java bytecodes Instructions to enter and exit Jazelle state Unsupported bytecodes interpreted in software by JVM Requires Jazelle-aware JVM Relatively proprietary Free/Open Source JVM’s cannot take advantage

Thumb 16-bit extension to 32-bit ARM ISA “Most commonly used” ARM instructions in 16-bit form Enables higher code density “Reduces memory bandwidth and size requirements by up to 35%” [4] Like Jazelle, requires extra pre-decode translation hardware Can link Thumb-compiled code optimized for space against performance-critical code compiled to 32-bit ARM

References ①“ARM1176JZF-S Processor Technical Reference Manual”, ARM Limited, Lit.-Nr.: ARM DDI 0301F, ②“TrustZone Hardware Architecture”, ARM Limited, re.html, downloaded Dec. 4, re.html ③“Trust Zone System Design”, design.html, downloaded Dec. 4, design.html ④“ARM1176JZ(F)-S”, ARM Limited, downloaded Dec. 4,