Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fast Functional Simulation with a Dynamic Language Craig S. Steele, Exogi LLC, USA JP Bonn, Exogi LLC, USA.

Similar presentations


Presentation on theme: "Fast Functional Simulation with a Dynamic Language Craig S. Steele, Exogi LLC, USA JP Bonn, Exogi LLC, USA."— Presentation transcript:

1 Fast Functional Simulation with a Dynamic Language Craig S. Steele, Exogi LLC, USA JP Bonn, Exogi LLC, USA

2 2 HPEC 2012 Fast Functional Simulation with a Dynamic Language Craig S. Steele and JP Bonn Exogi LLC Las Vegas, NV, USA

3 3 Presentation Outline Motivation for Fast Functional Simulator –Programmable Components Static versus Dynamic Simulators –Dynamic Binary Translation Lua Scripting Language Dynamic Language Tracing JIT compilers Instruction Pipeline in Software Simulating Heterogeneous Devices

4 4 Motivation for Fast Functional Sim Larger systems (including SoCs, FPGAs) –Many programmable components –Non-programmable have functional spec Lower-level simulations slow, getting slower Waterfall (or over-the-wall) HW/SW flow –Software is late from the start, unintegrated –Fast functional sim enables SW developers Future resilient systems need autonomic SW

5 5 Static vs. Dynamic ISS Fetch-decode-execute iterative interpreter –10+ MIPS –SW analog of a non-pipelined CPU Pre-decode binary instructions Convert each instruction to “C” function call Compile C function call to native code Do more of the above at runtime Support JIT and self-modifying code

6 6 Dynamic Binary Translation Dynamic Binary Translator (DBT) does it all QEMU (Quick EMUlator) Fabrice Ballard Hundreds rather than tens of simulated MIPS –Basic blocks (BBs), not single instructions Can be used for whole-system simulation –Mainly PC-oriented virtual machines –Also some embedded non-PC systems –Support for several instruction sets

7 7 Code Generation DSL Model target instruction set with “micro-ops” Tiny Code Generator (TCG) –Second Generation DSL Another app-specific language to remember –Each instruction set is a different port –Supportive user community is tiny Simulator code is very different from TCG –New device may need new TCG micro-ops –Code differs for prog & non-prog devices

8 8 Static vs. Dynamic Languages Static Languages: C / C++ –Variables have a type –Compiler detects type mismatches –Compiled separately from execution Optimize lots of possible paths Dynamic Languages –Data is typed, variable can hold any datum –Type is detected at runtime –“Scripting” languages are often interpreted

9 9 Lua Language Small, multi-paradigm “scripting” language Supports a functional programming style –First-class functions & closures Small conceptual and resource footprint –Well under 300KB memory footprint Good embeddability and portability Good interface to C

10 10 Lua Language Quirks Only composite data type is table –Associative array –Supports both integer and hashed keys Limited native data types “Number” type: double or dual double/integer –Dynamic type inference/specialization No standardized class-based object system –Many object systems, may not intermix Tables & closures reduce centrality of classes

11 11 LuaJIT Compiler Tracing Just-in-Time compilers –Excellent for dynamic languages –Data types can be observed at runtime –Optimization and specialization of types Source compiled to register-based bytecode Fast assembly-coded interpreter for tracing “Hot code” chunks optimized to native binary Excellent C Foreign-Function Interface (FFI)

12 12 C One-Instr. Fetch-Execute Loop for (int i=0; i < nIters; i++) { #if TRACE printf("%08d| PC = 0x%08X\n", i, node->curPC); if (node->curPC == pausePC) pauseCount++; #endif uint32_t deltaPC = sfp((void *)di, (void *)node); node->curPC = node->nextPC; di = virtual_icache_entry(node->nextPC,node); sfp = di->fPtr; node->nextPC += deltaPC; }

13 13 Lua One-BB Fetch-Execute Loop for i = 1, n_BBs do -- 140 MOps vs 572 MOps with single-entry cache if cur_pc ~= next_pc then -- avoid table lookup if no change cur_pc = next_pc cur_instr = loc_imem[cur_pc] end next_pc = cur_instr(cpu) end

14 14 Closures as Object Substitutes 1 -- Generated generic code for all “xor” instructions -- Anonymous function is a “factory” for XOR/XORI instr. instances -- When factory function is called, function arguments are persistently bound to values return function(seq_instr,seq_pc,br_pc,cpu,op_fcns,store_RX,valA,valB) local result = op_fcns.lm32_alu_fcn_xor(valA,valB) return store_RX(seq_instr,seq_pc,br_pc,cpu,result) end

15 15 Closures as Object Substitutes 2 -- Generated generic code for all “bge” branch instructions -- Anonymous function is a “factory” for BGE/BGEU instr. instances -- When factory function is called, function arguments are persistently bound to values return function(seq_instr,seq_pc,br_pc,cpu,op_fcns,select_pc,valA,valB) local result = op_fcns.lm32_cond_fcn_bge(valA,valB) local next_pc = select_pc(seq_pc,br_pc,result) return next_pc -- Dynamic return types: (integer | function ref) end

16 16 Tail-Call Optimization Where does “TCOSim” name come from? TC: if last statement in a function is a fcn call A tail call is a GOTO for functional languages Tail-call Optimization (TCO) –Recursion without a stack Multiple functions can be optimized as one LuaJIT tracing compiler optimizes calls away Functions are organizational, not executable

17 17 Pipelined Instruction Definition Example: 3-operand RISC ALU instruction class local function lm32_build_RR_format_instr(format,instr) … local load_valB = lm32_load_RZ[instr[4]] … return load_valB(seq_instr, seq_pc,cpu,lm32_ops,store_val,alu_op,load_valA) …return load_valA (seq_instr, … … return alu_op(seq_instr, … … return seq_instr(cpu) Example: Helper ALU operation shared by XOR/XORI local function lm32_alu_fcn_xor(x,y) local result = bit_bxor(x + y) return result end

18 18 Basic Block Instruction Fusion Non-branching instructions end with tail call return seq_instr(cpu) Branch instructions return new PC lm32_operation_builder_string["BI"] = [[ return function(seq_instr, seq_pc,br_pc,cpu,op_fcns,select_pc,valA,valB) local result = op_fcns.lm32_cond_fcn_%s(valA,valB) local next_pc = select_pc(seq_pc,br_pc,result) return next_pc end ]]

19 19 Basic C vs. TCOSim ISS Speed C-based Single-Instruction Sim: Tens of MIPS

20 20 Simulating Heterogeneous Devices Simulator calls a chain of instructions –Basic block, ending with branch –Device/CPU is a parameter CPU or other programmed HW With HDL design, most HW is a program Devices can be intermixed in BB chain –Multiple CPU HW threads (or hetero CPUs) –DMA devices –Event polling devices, e.g., interrupt control

21 21 Future Directions Integrate gdb debug server –Compile existing C server code for target –Run gdb server as “virtual” thread on sim –Should be fairly portable to real HW Support external events, e.g., interrupts ① Interleave polling instruction blocks ② Asynchronous callback to inner loop Multi-threaded simulation host support Self-host LuaJIT compiler within sim target

22 22 Conclusion Lua: small dynamic scripting language –Easy dynamic-code-creating programs –Fast edit-run-expletive cycle –Can be fast with a tracing JIT compiler Surprisingly good fit for functional simulation Loose typing good for heterogenous devices Conventional coding, not exotic ISS DSL Small enough and fast enough to go meta

23 23 Meta-Meta http://xkcd.com/917/

24 24 end --Thank You! Craig Steele, steele@exogi.com

25 25 Caching in Lua Lua has one primitive composite data type Table is associative array –Almost any data type can be used as index –Integers are optimized like C arrays –Other index types are used for hashtable –Tables are both arrays and hashtables Meta-tables allow custom index operations Using one table to cache another is very easy

26 26 A Use Case of Caching in Lua Instruction-memory decoding can be either –Static: binary available at start, unchanging –Dynamic: binary is JITed or self-modifying Step 0 – Get binary encoded instruction Step 1 – Decode to table, e.g., {"add”,1,0,21} Step 2 – Build instruction as stack of Lua fcns Step 3 – Link instructions into basic blocks Step 4 – Locate BBs at absolute addresses Every step can be cached and “on demand”


Download ppt "Fast Functional Simulation with a Dynamic Language Craig S. Steele, Exogi LLC, USA JP Bonn, Exogi LLC, USA."

Similar presentations


Ads by Google