Presentation on theme: "Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch."— Presentation transcript:
Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch prediction “TrustZone” security built-in to the CPU Instruction and data caches 8-stage pipeline 32-bit and 16-bit (“Thumb”) instruction sets, and “Jazelle” technology for Java execution
Memory Hierarchy Harvard architecture: separate data and instruction caches Allows simultaneous access 64-bit datapaths L1 Cache up to 64KB in size 4-way set associative virtual index, physical tag 8 words per line, critical word first on miss Round robin or pseudo-random replacement policy 
Level 2 Interface “high-bandwidth interface to second level caches, on-chip RAM, peripherals, and interfaces to external memory”  Level 2 interconnect 64-bit wide interfaces: Instruction Fetch Data Read/Write DMA Peripheral Interface is 32 bits wide
Translation Lookaside Buffer (TLB) MicroTLBs One each for instructions, data 10 entries Fully associative Round-robin or random replacement Single Main TLB Contains a fully-associative region of 8 lockable elements Misses handled by two-level page table
Coprocessor interface Core processor can interface to on-chip coprocessors Instruction set supports up to 16 coprocessors Two of these are used by the VFP Coprocessors intended to run in-step with core, share data Two-cycle delay: “generous timing margins”  Loose synchronization via token queues Core may flush coprocessor pipeline or cancel instructions Only one coprocessor “active” at one time Not so bad: calls to driver software = core instructions Allows much of the interface to be shared ($$$)
Coprocessor Synchronization 
VFP Coprocessor Uses a dedicated interface to the processor IEEE 754 Standard for Binary Floating-Point Arithmetic 64-bit load and store buses 3 independent, parallel pipelines: Load and store Multiply and accumulate Divide and square root Short vector instructions: 8 single precision, 4 double No branch instructions
Branch Prediction Branch Prediction (BP) can be turned on and off with a control register. Provides high level of control The ARM processor performs two types of BP Dynamic: performed in the Prefetch Unit Static: performed by the integer core (and the first time, before historical data exists) Branch folding After prediction, the branch instruction is completely removed from the instruction stream presented to the pipeline.
Dynamic Branch Prediction Dynamic Branch Prediction is the “first line” of branch prediction: if history exists, it will be used. The Branch Target Address Cache (BTAC) holds virtual target addresses of previous branches 128-entry, direct mapped cache Includes a 2-bit branch prediction history. A BTAC hit produces a branch prediction with zero cycle delay Both branches (resolved taken and not taken) are stored in the BTAC, which improves performance. Branch folding is done for almost all dynamically predicted branches.
Static Branch Prediction Static Branch Prediction is only based on the branch instruction characteristics (i.e., it does not utilize history) Simple: All forward conditional branches are not taken, and all backward branches are taken. “Around 65% of all branches are preceded by enough non- branch cycles to be completely predicted.”  The static branch predictor is used on compulsory misses (i.e., the first time a branch is encountered) when there are capacity or conflict misses in the BTAC
TrustZone The ARM1176 processors implement “TrustZone” security extensions that “provide a secure environment for software”  dddd  The hardware is partitioned so that the resources are physically separated on the chip, creating a strong boundary between the Normal World and the Secure World Two virtual processors are created from the one physical processor, removing the need for a separate processor dedicated to security TrustZone aware hardware such as DMA controllers allow secure data transfer Examples of how TrustZone can be used include secure PIN entry from the keyboard, to Digital Rights Management of multimedia data.
Integer Pipeline Up to 4 instructions fetched Static branch prediction in Fe2 Decode/Issue can hold branch alongside other instruction Non-blocking loads Hit Under Miss (HUM) buffer Up to 4 instructions fetched Static branch prediction in Fe2 Decode/Issue can hold branch alongside other instruction Non-blocking loads Hit Under Miss (HUM) buffer
Jazelle Java hardware acceleration Java bytecode translated to ARM instruction(s) Extra decode logic between Fetch and Decode stages Extension of ARM instruction set Limited (unpublished) subset of Java bytecodes Instructions to enter and exit Jazelle state Unsupported bytecodes interpreted in software by JVM Requires Jazelle-aware JVM Relatively proprietary Free/Open Source JVM’s cannot take advantage
Thumb 16-bit extension to 32-bit ARM ISA “Most commonly used” ARM instructions in 16-bit form Enables higher code density “Reduces memory bandwidth and size requirements by up to 35%”  Like Jazelle, requires extra pre-decode translation hardware Can link Thumb-compiled code optimized for space against performance-critical code compiled to 32-bit ARM
References ①“ARM1176JZF-S Processor Technical Reference Manual”, ARM Limited, Lit.-Nr.: ARM DDI 0301F, ②“TrustZone Hardware Architecture”, ARM Limited, re.html, downloaded Dec. 4, re.html ③“Trust Zone System Design”, design.html, downloaded Dec. 4, design.html ④“ARM1176JZ(F)-S”, ARM Limited, downloaded Dec. 4,