4/14/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam King,

Slides:

Advertisements

Similar presentations

More Intel machine language and one more look at other architectures.

Advertisements

Fetch Execute Cycle – In Detail -

Instruction Set Design

Process management Information maintained by OS for process management  process context  process control block OS virtualization of CPU for each process.

Adding the Jump Instruction

ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.

Machine/Assembler Language Putting It All Together Noah Mendelsohn Tufts University Web:

Chapter 2 Machine Language.

ELEN 468 Advanced Logic Design

Fall EE 333 Lillevik 333f06-l4 University of Portland School of Engineering Computer Organization Lecture 4 Assembly language programming ALU and.

1 Today’s lecture  Last lecture we started talking about control flow in MIPS (branches)  Finish up control-flow (branches) in MIPS —if/then —loops —case/switch.

1 RISC Pipeline Han Wang CS3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 4.6.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University RISC Pipeline See: P&H Chapter 4.6.

Stored Program Concept: The Hardware View

1 Sec (2.3) Program Execution. 2 In the CPU we have CU and ALU, in CU there are two special purpose registers: 1. Instruction Register 2. Program Counter.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 4: IT Students.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 4:

Computer Science 210 Computer Organization The Instruction Execution Cycle.

12/5/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam King,

COMPILERS CLASS 22/7,23/7. Introduction Compiler: A Compiler is a program that can read a program in one language (Source) and translate it into an equivalent.

AMD64/EM64T – Dyninst & ParadynMarch 17, 2005 The AMD64/EM64T Port of Dyninst and Paradyn Greg Quinn Ray Chen

12/18/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam.

Reminder Bomb lab is due tomorrow! Attack lab is released tomorrow!!

1/31/20161 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam King,

CS61C L20 Datapath © UC Regents 1 Microprocessor James Tan Adapted from D. Patterson’s CS61C Copyright 2000.

COMP 1321 Digital Infrastructure Richard Henson University of Worcester October 2012.

Lecture 6: Branching CS 2011 Fall 2014, Dr. Rozier.

Lecture 2: Instruction Set Architecture part 1 (Introduction) Mehran Rezaei.

Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.

Pipeline Processor Design Project Jarred Beck. Design Assumptions Three bit opcode This is to be able to address all of the 8k memory directly =

3/17/20161 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam King,

Information Security - 2. CISC Vs RISC X86 is CISC while ARM is RISC CISC is Compiler’s heaven while RISC is Architecture’s heaven Orthogonal ISA in RISC.

E Virtual Machines Lecture 2 CPU Virtualization Scott Devine VMware, Inc.

Instruction Set Architectures Continued. Expanding Opcodes & Instructions.

Spring 2016Assembly Review Roadmap 1 car *c = malloc(sizeof(car)); c->miles = 100; c->gals = 17; float mpg = get_mpg(c); free(c); Car c = new Car(); c.setMiles(100);

Machine-Level Programming I: Basics

Displacement (Indexed) Stack

Chapter 5: A Multi-Cycle CPU.

CS161 – Design and Architecture of Computer Systems

Credits and Disclaimers

COMP 1321 Digital Infrastructure

Assembly Programming II CSE 351 Spring 2017

Vijay Janapa Reddi The University of Texas at Austin Interpretation 2

Chapter 11 Instruction Sets

Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)

Operating Systems Design (CS 423)

Choices in Designing an ISA

ELEN 468 Advanced Logic Design

Prof. Sirer CS 316 Cornell University

Recitation: Attack Lab

Computer Science 210 Computer Organization

CS/COE0447 Computer Organization & Assembly Language

The fetch-execute cycle

Computer Science 210 Computer Organization

Single-Cycle CPU DataPath.

CSE 351 Section 10 The END…Almost 3/7/12

x86-64 Programming I CSE 351 Autumn 2017

Instruction Set Architectures Continued

Guest Lecturer TA: Shreyas Chand

x86-64 Programming I CSE 351 Winter 2018

x86-64 Programming I CSE 351 Winter 2018

COMP 1321 Digital Infrastructure

Get To Know Your Compiler

A Level Computer Science Topic 5: Computer Architecture and Assembly

Processor Design CS 161: Lecture 1 1/28/19.

Credits and Disclaimers

Credits and Disclaimers

Caches & Memory.

Low level Programming.

Sec (2.3) Program Execution.

Presentation transcript:

4/14/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam King, and Andrew S Tanenbaum

Dynamic binary translation Simulate basic block using host instructions. Basic block? Cache translations 4/14/20152 Load r3, 16(r1) Add r4, r3, r2 Jump 0x48074 Load r3, 16(r1) Add r4, r3, r2 Jump 0x48074 Load t1, simRegs[1] Load t2, 16(t1) Store t2, simRegs[3] Load t1, simRegs[1] Load t2, 16(t1) Store t2, simRegs[3] Load t1, simRegs[2] Load t2, simRegs[3] Add t3, t1, t2 Store t3, simregs[4] Load t1, simRegs[2] Load t2, simRegs[3] Add t3, t1, t2 Store t3, simregs[4] Store 0x48074, simPc J dispatch_loop Binary Code Translated Code

Performance Why is this faster? Are there disadvantages? 4/14/20153 while(1){ inst = mem[PC]; // fetch //decode if(inst == add){ //execute... } } // repeat while(1){ if(!translated(pc)){ translate(pc); } jump to pc2tc(pc); } // repeat

Performance Why is this faster? Caching the translation Are there disadvantages? Harder to code, slower if only execute once 4/14/20154 while(1){ inst = mem[PC]; // fetch //decode if(inst == add){ //execute... } } // repeat while(1){ if(!translated(pc)){ translate(pc); } jump to pc2tc(pc); } // repeat

Dynamic Binary Translation r0 (T0) r1 (T1) r2 (T2) r3 r4 r5 4/14/20155 Physical Registers cpu->regs Why don’t we use one-to-mapping from guest regs to host regs?

Dynamic Binary Translation r0 (T0) r1 (T1) r2 (T2) r3 r4 r5 4/14/20156 Physical Registers cpu->regs Guest: Load r3, 16(r1) Load r1, imm(r0)

Dynamic Binary Translation r0 (T0) r1 (T1) r2 (T2) r3 r4 r5 4/14/20157 Physical Registers cpu->regs gen_load_T0(guest_r1) Load r0, gr1_off(r5) Guest: Load r3, 16(r1)

Dynamic Binary Translation r0 (T0) r1 (T1) r2 (T2) r3 r4 r5 4/14/20158 Physical Registers cpu->regs gen_load_T1_T0(16) Load r0, gr1_off(r5) Guest: Load r3, 16(r1) Load r1, 16(r0)

Dynamic Binary Translation r0 (T0) r1 (T1) r2 (T2) r3 r4 r5 4/14/20159 Physical Registers cpu->regs gen_store_T1_(guest_r3) Load r0, gr1_off(r5) Guest: Load r3, 16(r1) Load r1, 16(r0) Store r1, gr3_off(r5)

Dynamic Binary Translation r0 (T0) r1 (T1) r2 (T2) r3 r4 r5 4/14/ Physical Registers cpu->regs Load r0, gr1_off(r5) Guest: Load r3, 16(r1) Load r1, 16(r0) Guest mem: 0x1190 -> 0x1234 Store r1, gr3_off(r5) r0 (T0) r1 (T1)0x1180 r2 (T2) r3 r4 r5 Sim Registers

Dynamic Binary Translation r0 (T0)0x1180 r1 (T1) r2 (T2) r3 r4 r5 4/14/ Physical Registers cpu->regs Load r0, gr1_off(r5) Guest: Load r3, 16(r1) Load r1, 16(r0) Guest mem: 0x1190 -> 0x1234 Store r1, gr3_off(r5) r0 (T0) r1 (T1)0x1180 r2 (T2) r3 r4 r5 Sim Registers

Dynamic Binary Translation r0 (T0)0x1180 r1 (T1)0x1234 r2 (T2) r3 r4 r5 4/14/ Physical Registers cpu->regs Load r0, gr1_off(r5) Guest: Load r3, 16(r1) Load r1, 16(r0) Guest mem: 0x1190 -> 0x1234 Store r1, gr3_off(r5) r0 (T0) r1 (T1)0x1180 r2 (T2) r3 r4 r5 Sim Registers

Dynamic Binary Translation r0 (T0)0x1180 r1 (T1)0x1234 r2 (T2) r3 r4 r5 4/14/ Physical Registers cpu->regs Load r0, gr1_off(r5) Guest: Load r3, 16(r1) Load r1, 16(r0) Guest mem: 0x1190 -> 0x1234 Store r1, gr3_off(r5) r0 (T0) r1 (T1)0x1180 r2 (T2) r30x1234 r4 r5 Sim Registers

Exercise Generate the translation code for bitwise OR of guest code E.g., Or r4, r2, r1 -> r4 = r2 | r1 Write any prototypes for code gen E.g., gen_load_T1_T0(imm) Write the translation code Note: feel free to reuse any codegen from example 4/14/201514

Useful functions Instruction: or r4, r2, r1 r4 = r2 | r1 Gen_load_T0(guest reg no) Gen_load_T1(guest reg no) Gen_store_T2(guest reg no) 4/14/201515

Solution 4/14/201516

Solution or r4, r2, r1 Gen_load_T0(gr2_off) Gen_load_T1(gr1_off) Gen_or_T2_T0_T1() Or r2, r0, r1 Gen_store_T2(gr4_off) 4/14/201517

Performace 4/14/201518

Dynamic binary translation for x86 Goal: translate this basic block Note: x86 assembly dst value is last Mov %rsp, %rbp Sub $0x10, %rsp Movq $0x100000, 0xfffffffffffffff8(%rbp) Problem: x86 instructions have varying length 4/14/201519

Inst-by-inst simulation: fetch void fetch(struct CPU *cpu, struct control_signals *control){ memset(control, 0, sizeof(struct control_signals)); control->orig_rip = cpu->rip; control->opcode[control->numOpcode++] = fetch_byte(cpu); if(IS_REX_PREFIX(control->opcode[0])) { control->rex_prefix = control->opcode[control->numOpcode-1]; control->rex_W = REX_PREFIX_W(control->rex_prefix); control->opcode[control->numOpcode-1] = fetch_byte(cpu); } //and more of this for multi-byte opcodes 4/14/201520

Inst-by-inst simulation: decode if(HAS_MODRM(control->opcode[0]) { control->modRM = fetch_byte(cpu); control->mod = MODRM_TO_MOD(control->modRM); control->rm = MODRM_TO_RM(control->modRM); control->reg = MODRM_TO_R(control->modRM); // now calculate m addresses if(IS_SIB(control)){ // more address lookup and fetching } else if(IS_DISP32) { // fetch immediate } else { addr = getRegValue(cpu, control) } // now fetch any disp8 or disp32 for addr 4/14/201521

Dynamic translation Amortize the cost of fetch and decode Two different contexts Translator context – fetch and decode Guest context – execute 4/14/201522

Dynamic translation for x86 Map guest registers into these host registers T0 -> rax T1 -> rbx T2 -> rcx – use for host addresses Rdi -> holds a pointer to cpu 4/14/201523

Inst-by-inst: Mov %rsp, %rbp cpu->reg[control->rm] = cpu->reg[control->reg]; where rm == rbp index and reg == rsp index 4/14/201524

DBT: Mov %rsp, %rbp void mov(struct CPU *cpu, struct control_signals *control) { gen_load_T0(cpu->tcb, getRegOffset(cpu, control->reg)); gen_store_T0(cpu->tcb, getRegOffset(cpu, control->rm)); } 4/14/201525

DBT: Mov %rsp, %rbp void gen_load_T0(struct tcb *tcb, uint32_t offset) { uint64_t opcode = offset; // create movq offset(%rdi), %rax opcode <<= 24; opcode |= 0x878b48; // store in cache memcpy(tcb->buffer+tcb->bytesInCache, translation, size); tcb->bytesInCache += size; 4/14/201526

DBT: compile // Mov %rsp, %rbp gen_load_T0(control->reg) gen_store_T0(control->rm) //Sub $0x10, %rsp gen_load_rm32_T0(cpu, control) gen_alu_imm32_T0(tcb, imm, SUB_ALU_OP) gen_store_rm32_T0(cpu, control) // movq $0x100000, -8(%rbp) gen_mov_imm32_T0(cpu, control->imm) gen_store_rm64_T0(cpu, control) 4/14/201527

Switching to guest context Jump_to_tc(cpu) { //will return a pointer to beginning of TC buf void (*foo)(struct CPU *) = lookup_tc(cpu->rip) foo(cpu); } What does the preamble for your TCB look like? How do we return back to the host (or translator) context? 4/14/201528

DBT optimizations Mov %rsp, %rbp Sub $0x10, %rsp movq $0x100000, -8(%rbp) Are there opportunities for optimizations? 4/14/201529

Discussion questions Could you run this raw code on the CPU directly? What assumptions do you have to make? When do you have to invalidate your translation cache Break into groups to discuss this Read embra for discussion about using guest physical vs guest virtual pc values to index tb 4/14/201530