Presentation is loading. Please wait.

Presentation is loading. Please wait.

Codesigned Virtual Machine -Transmeta CRUSOE-

Similar presentations


Presentation on theme: "Codesigned Virtual Machine -Transmeta CRUSOE-"— Presentation transcript:

1 Codesigned Virtual Machine -Transmeta CRUSOE-
Chang Hyun Lee System Design Group Seoul National University

2 Index Introduction Overview (and some details) of HW & SW
HW part - VLIW architecture SW part - Code Morphing Software Detailed description Speculation and Recovery Precise trap Memory mapped IO Data speculation Self modifying Conclusion

3 Introduction Who is Transmeta? Design Goal of Crusoe
                             Introduction Who is Transmeta? CTO David Ditzel, who had worked for SUN Microsystems, established Transmeta in 1995 Nobody had interest in Transmeta before Transmeta employed Linus Torvalds. “Our first product’s name is crusoe “ & "We have rethought the microprocessor“ <- only this two sentences was displayed on their homepage. they introduced CRUSOE into a market. Design Goal of Crusoe The everchanging technology market continues to drive the need for ever more compact designs. high performance but are smaller in size, consume less power for longer battery life, and run cooler without the need for fans. Lighter , Longer, Cooler

4 Introduction Characteristics of Crusoe
Transmeta’s Crusoe microprocessor is a full, systemlevel implementation of the x86 architecture, comprising a native VLIW microprocessor with a software layer, the Code Morphing Software (CMS) 128bit VLIW(very long instruction word) engine The Crusoe processor features a 128-bit wide VLIW (Very Long Instruction Word) engine that can issue up to 4 instructions per clock cycle. Code Morphing Software Code Morphing Software (CMS) layer provides the Transmeta Crusoe processor with x86 compatibility while empowering the complex microprocessor with the flexibilities Integrated Architecture To ease system design and enhance performance, Transmeta has integrated Northbridge functionality — SDR and DDR SDRAM memory controllers, a 32-bit, 33MHz PCI bus controller and a Serial ROM interface controller — directly into the Crusoe processor die. LongRun Technology Transmeta LongRun technology allows the Transmeta Crusoe processor to conserve power by dynamically adjusting its voltage and clock frequency.

5 Introduction Characteristics of Crusoe
Transmeta’s Crusoe microprocessor is a full, systemlevel implementation of the x86 architecture, comprising a native VLIW microprocessor with a software layer, the Code Morphing Software (CMS) 128bit VLIW(very long instruction word) engine The Crusoe processor features a 128-bit wide VLIW (Very Long Instruction Word) engine that can issue up to 4 instructions per clock cycle. Code Morphing Software Code Morphing Software (CMS) layer provides the Transmeta Crusoe processor with x86 compatibility while empowering the complex microprocessor with the flexibilities Integrated Architecture To ease system design and enhance performance, Transmeta has integrated Northbridge functionality — SDR and DDR SDRAM memory controllers, a 32-bit, 33MHz PCI bus controller and a Serial ROM interface controller — directly into the Crusoe processor die. LongRun Technology Transmeta LongRun technology allows the Transmeta Crusoe processor to conserve power by dynamically adjusting its voltage and clock frequency.

6 Overview of HW part What is VLIW(Very Long Instruction Word)?
A set of instruction to be executed in parallel format as a single large instruction.

7 Overview of HW : Architecture
TM5800 Microarchitecture 2MB VMM 512KB Compressed VMM 8KB Local Program memory 14MB Crusoe Data& Translation 8KB Local data memory 64KB I-Cache 64KB D-Cache Memory Hierachy of Crusoe

8 Overview of HW : register set
The processor has 64 GPRs, with the following specialized semantics: * %r63 (%zero) always reads 0 when used as a source operand * %r62 (%sink) is a discarded destination (e.g., for compares); it is never read * %r59 (%from) saved return address * %r58 (%link) return address * %r47 (%sp) is the current stack pointer * %r0 (%eax) for current x86 machine state * %r1 (%ecx) for current x86 machine state * %r2 (%edx) for current x86 machine state * %r3 (%ebx) for current x86 machine state The 48 of these GPRs are backed by shadowed GPRs: whenever a bundle has its commit bit set, the Commit stage latches the current values of the GPRs into the 'known good' shadow GPRs. The processor also includes bit floating point registers and 16 FP shadow registers. There are also a wide variety of special purpose registers (SPRs), including the condition codes, profiling registers, power control settings and so on.

9 Overview of HW : Instruction set
Instruction Encoding Instructions are encoded in little endian byte and word order as shown in the following diagram: All instructions (except branches) have a 9-bit opcode field. All opcodes share a common mapping into this 9-bit space. The ALU0|imm32 and ALU0|ALU1 bundle types share the same format code (10) but the ALU1 slot is interpreted as an imm32 depending on the opcode: 11xxxx011: 32-bit immediate in place of ALU1 All others: execute ALU1 as instruction If the 11xxxx011 pattern appears in an ALU1 slot, an 8-bit immediate is used instead. It is not clear why this encoding is sometimes used instead of the normal 8-bit immediate form.

10 Overview of HW : Instruction set
Figure below shows the formatting for instructions to the two ALUs Two ALUs are provided in Crusoe. It appears that ALU1 executes a superset of the operations available on ALU0. The ALU1 slot is also used for all floating point and MMX operations, as indicated by ALU1's type select bits being something other than '00'. All LSU operations take a fully calculated address in register ra; as with most VLIW architectures, no ra+offset or ra+rb addressing modes are provided.

11 Overview of HW : Instruction set
Figure below shows the formatting for instructions to the Branch Branches (both conditional and unconditional) within CMS use a 23 bit absolute target address aligned to a 64-bit boundary (i.e., abstarget is shifted left 3 bits). Conditional branches use the exact same condition code set (cc bits) as the x86 encoding in jump instructions. Unconditional branches can optionally write the return address to the %link register (%r58) if the L bit (bit 0 of the cc field) is set. Indirect branches occur through a general purpose register. It appears that special instructions are provided to prepare for an indirect branch when the target address is known in advance; this avoids the three-cycle branch penalty. ex) Lookup_Jump Rk instruction that performs the jump to the TPC if there is a hit, otherwise it falls through It appears that the CMS address space is the only region from which code can be executed; the processor is physically incapable of executing code directly from user space. This makes sense considering that all x86 code must be translated (and thus copied to CMS space) before native execution. With 23 bits abstarget, 1 << (23+3) only allows up to 64 MB to be dedicated to CMS; this is probably one reason why larger translation caches have not been used.

12 Overview of HW : pipelining
The top row of the diagram indicates the pipeline for an ALU instruction, with the other rows representing the two other types of logical units. The pipeline is a fairly typical RISC design: Fetch0: The first 64 bits of a 64-bit or 128-bit bundle are fetched Fetch1: The second 64 bits are fetched (for 128-bit bundles only) Regs: Read source registers and decode/disperse instructions> ALU: Execute single cycle operations in ALU0 and ALU1 Except: Complete two-cycle ALU0/ALU1 ops and detect exceptions Cache0: Initiate L1 data cache access based on register address Cache1: Complete L1 data cache access, TLB access and alias checks Write: Write results back to GPRs or store buffer Commit: Optionally latch the lower 48 GPRs into the shadow registers

13 Overview of SW part : CMS
The Code Morphing software is fundamentally a dynamic translation system, a program that compiles instructions for one instruction set architecture (in this case, the x86 target ISA) into instructions for another ISA (the VLIW host ISA). Transmeta’s Crusoe microprocessor is a full, system level implementation of the x86 architecture, comprising a native VLIW microprocessor with a software layer, the Code Morphing Software (CMS), that combines an interpreter, dynamic binary translator, optimizer, and runtime system.

14 Overview of SW : CMS CMS have to satisfy..
• CMS must faithfully implement the complete x86 architecture: all instructions (including memorymapped I/O), architectural registers, and complete exception behavior. • CMS can make no assumptions about the operating system running on the processor and cannot depend on information or other assistance from the system. It is a system-level implementation, not application-level, and even executes the BIOS code. • CMS must provide robust performance for a wide variety of systems and applications. This requires dealing with unpleasant realities like self-modifying code and precise exceptions.

15 Overview of SW : CMS Typical CMS control flow
CMS is structured like many other dynamic translation systems. Initially, an interpreter decodes and executes x86 instructions sequentially. When the number of executions of a section of x86 code reaches a certain threshold, its address is passed to the translator. The translator selects a region and stores the translation with various related information in the translation cache. From then on, until something invalidates the translation cache entry, CMS executes the translation when the x86 flow of control reaches the translated code region. Once the branch target is identified as another translation, the branch operation is modified to go directly there, a process called chaining 1. Initially, an interpreter decodes and executes x86 instructions sequentially, with careful attention to memory access ordering and precise reproduction of faults, while collecting data on execution frequency, branch directions, and memory-mapped I/O operations. 2. But a variety of exceptional events may interrupt this typical control flow.

16 Overview of SW : CMS translation example X86 instruction
A. addl %eax,(%esp) // load data from stack, add to %eax B. addl %ebx,(%esp) // ditto, for %ebx C. movl %esi,(%ebp) // load %esi from memory D. subl %ecx,5 // subtract 5 from %ecx registe In a first pass, the front end of the translation – simple translation ld %r30,[%esp] // load from stack, into temporary add.c %eax,%eax,%r30 // add to %eax, set condition codes. ld %r31,[%esp] add.c %ebx,%ebx,%r31 ld %esi,[%ebp] sub.c %ecx,%ecx,5 In a second pass, the optimizer. applying well known compiler optimization skill such as common subexpression elimination, loop invariant removal or dead code elimination. ld %r30,[%esp] // load from stack only once add %eax,%eax,%r30 add %ebx,%ebx,%r30 // reuse data loaded earlier sub.c %ecx,%ecx,5 // only this last condition code needed In a final pass, the scheduler. reordering atoms into molecules. 1. ld %r30,[%esp]; sub.c %ecx,%ecx,5 2. ld %esi,[%ebp]; add %eax,%eax,%r30; add %ebx,%ebx,%r30 common subexpression elimination : Common subexpression elimination is commonly employed to reduce the number of operations in DSP algorithms after decomposing constant multiplications into shifts and additions.

17 Speculation & Recovery
Speculation : to make and exploit assumptions – unproven at translation time – about the code being translated. ex) the translator might assume that two specific load and store instructions reference non-overlapping memory This type of speculation enables generation of much more efficient translations, but should one or more assumptions prove to be false, incorrect results may be produced. CMS uses a combination of hardware and software mechanism to detect failing assumptions. commit stage and shadow register. ( discussed in next page)

18 Speculation & Recovery
Hardware Support for Speculation and Recovery There exist two copies of each register, a working copy and a shadow copy. Normal atoms : only update working copy when execution reaches the end of a translation, commit operation copies all working registers into their corresponding shadow registers. if any exception condition occur occur inside a translation block, the runtime system undoes the effects of all molecules executed :rollback roll back : copies the shadow register values (committed at the end of the previous translation) back into the working registers. Following a rollback, CMS usually interprets the x86 instructions corresponding to the faulting translation, executing them in the original program order.

19 Challenges due to Speculation
Challenges which CMS meets by applying the procedure of speculation • CMS must faithfully reproduce the precise exception behavior of the x86 target, without overly constraining the scheduling of its translations. • CMS must respond to interrupts at precise x86 instruction boundaries, where the system possesses a consistent target state. • CMS must efficiently handle memory-mapped I/O and other system-level operations, without penalizing normal (non-I/O) memory references. • Legacy PC software, especially games, often includes performance-critical self-modifying code. Similar problems result from pages containing both code and data, common in Windows/9X device drivers, BIOSs, and embedded systems running a real-time operating system.

20 Precise Exception & interrupt
In the x86 ISA, exceptions are precise: when one instruction causes an exception, all instructions preceding it must complete before the exception is reported, and none of the subsequent instructions may complete. with hardware support for commit and rollback and the interpreter-based recovery procedure in place, CMS has much more flexibility in scheduling the translated instructions. Commit and rollback serve a similar purpose with respect to interrupts.

21 Memory-mapped I/O One of the most important rules associated with I/O transactions is that they must be performed in the original (x86) program order since they trigger irrevocable interactions with external devices. In the x86 architecture, devices can be accessed via two different mechanisms: explicit I/O instructions (“in/out”), and memory-mapped accesses. The former are easily recognized and translated appropriately. Memory mapped I/O, however, cannot be distinguished at translation time from regular memory accesses. In addition, a given x86 instruction can access both regular memory and I/O space over the course of program execution.

22 Memory-mapped I/O To solve the problem, load and store atoms on the Crusoe hardware specify whether they have been reordered with respect to the original x86 program. When such a speculative memory atom accesses a memory page that is mapped to I/O space, the hardware raises an exception. : To identify code regions that access volatile memory, an access protection bit can be added to TLB. At this point, CMS performs a rollback to the previously committed state and interprets. If the faults recur too often, CMS regenerates the translation, this time without reordering the offending memory reference.

23 Data Speculation In particular, it is often desirable to be able to reorder load instructions ahead of store instructions. However, doing that is incorrect if the load happens to use data from the preceding store The Crusoe host provides innovative alias hardware that addresses this problem. When the translator moves a load operation ahead of a store operation, it converts the load into a load-and-protect (which in addition to loading data also records the address and size of the data loaded) and the store into a store-under-alias-mask (which checks for protected regions). In the (unlikely) event that the store operation overwrites the previously loaded data, the processor raises an exception and the runtime system can take corrective action.

24 Data Speculation example X86 code VLSI code
ld %r30,[%x] // first load from location X ... st %data,[%y] // might overwrite location X ld %r31,[%x] // this accesses location X again use %r31 VLSI code ldp %r30,[%x] // load from X and protect it stam %data,[%y] // this store traps if it writes X use %r30 // can use data from first load

25 Self Modifying Code At times, x86 instructions in memory get overwritten, either because the operating system is loading a new program, or because an application is using self-modifying code. When this happens to code that has already been translated, the Code Morphing software needs to be notified to keep it from erroneously executing a translation for the old code. To this end, whenever the system translates a block of x86 code, it write-protects the page of x86 memory containing that code. It does so by setting a dedicated “translated” bit in that page’s entry in the processor’s memory management unit. (As with other details of the VLIW hardware, that bit is invisible to x86 software.) When a protected page is written to, the simplest remedy is to invalidate the affected translation(s). As the runtime system dynamically learns more about the program’s behavior, it switches to more sophisticated strategies (Self-Revalidating & Self-checking).

26 Self Modifying Code : Fine-Grain Protection
The Crusoe processor provides hardware support for write-protecting memory at granularity finer than full pages. SW : manage TLB HW : write protect table which hold the fine-grained write protection checking SMC Optimization Fine-Grain Protection Self-Revalidating Self-Checking

27 Self Modifying Code : Self Revalidating Translation
Once a candidate translation for self-revalidation is identified, it is flagged. The next time it is encountered, it is re-translated in order to capture the translated x86 code Later, if the handler for a fine-grain protection fault determines that the translation(s) might be affected, it enables the prologue and turns off protection to avoid faulting again. When the translation is next invoked, the prologue verifies that the x86 code corresponding to the translation has not changed, re-enables protection, re-verifies the x86 code, disables the prologue, and then executes the translation. If CMS determines that a translation is encountering legitimate protection faults due to data stores in the same region as code, it can make the faults less expensive by adding a prologue, a prologue, which is a code segment that is invoked just before a translation is entered. Prologues are generally used for temporary monitoring purposes, and allow easy installation and removal without disturbing the translation. Inserting a prologue involves removing any existing chains to the translation, and replacing its start address by the prologue address.

28 Self Modifying Code : Self Revalidating Translation
it can be quite efficient if the writes are much less frequent than executions of the affected translations. Further, this technique does not work if it is the translation itself that is writing on its associated x86 region, since the write occurs after the checking prologue has completed, causing a new fault and preventing forward progress. For such cases, the following technique -self checking- for optimizing fault detection may work better.

29 Self Modifying Code : Self Checking Translation
Instead of protecting the x86 page when creating a translation, it is possible to leave the memory page unprotected, and have the translation itself check that the source x86 bytes have not changed, by fetching them and comparing them to their values when the translation was created. We can merge the checking code into the normal translation code. the overhead of self-checking a translation once is many times smaller than that of self-revalidating it once, although its average cost may be much higher if the translation is executed many times between protection faults.

30 Conclusion Crusoe broke new ground in using a codesigned VM for achieving power efficiency and design simplicity. Crusoe achieves Low power Mobility Compatibility with X86 But Performance wasn’t good as their advertisement in real applications.

31 Reference Anonymous, Crusoe Exposed: Reverse Engineering the Transmeta TM5xxx Architecture I Anonymous, Crusoe Exposed: Reverse Engineering the Transmeta TM5xxx Architecture II Alexander Klaiber , The Technology Behind Crusoe™ Processors James C. Dehnert, The Transmeta Code Morphing Software: Linda Geppert , Magic show : Crusoe Report


Download ppt "Codesigned Virtual Machine -Transmeta CRUSOE-"

Similar presentations


Ads by Google