Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Similar presentations


Presentation on theme: "Kris Lange Nopparat suwaanarat Pree Thiengburanathum."— Presentation transcript:

1 Kris Lange Nopparat suwaanarat Pree Thiengburanathum

2  Introduction  Motivation  Review concepts  M5 architecture  Configuring M5 Simulator  Simulation  Results and Analysis  Conclusion

3  Basis: "Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures“  Paper makes 2 claims ◦ Heterogeneous CMP outperform homogenous CMP (for a fixed total die size) ◦ Benefits of heterogeneous CMP are enhanced using dynamic thread assignment policies

4  Gain deeper understanding of research paper  Verify results of this paper  Gain hands-on experience running a peer- reviewed experiment

5  Heterogeneous CMP system  Homogeneous CMP system  Heterogeneous VS Homogenous in multi- programmed.

6  Heterogeneous CMP system Many simple cores = higher thread parallelism Fewer cores, larger = lower thread parallelism We want to maximize resource utilization and achieve high degree of inter-thread parallelism. How? Mapping running tasks and using control mechanism.

7 Which one has a better total execution time? Control mechanism: Thread Assignment Policies: Static thread assignment random best Dynamic thread assignment round robin IPC driven P1P2 Thread A1.60.4 Thread B1.51

8 Static thread Assignment Usually assign thread to the faster core. Well studies problem before assign. Solution rely on heuristics a random static assignment. Don’t know the work loads and IPC. a pseudo best static assignment. Know the work loads and IPC, use heuristic to find out. Disadvantages: Doesn’t assign thread in run time.  does not optimize faster core(s) usage.  slow” threads on slower core(s) penalize overall system performance. 8

9  Dynamic thread assignment ◦ Round Robin Assignment  rotating the assignment of threads to processors in a round robin fashion.  ensures that the available faster are equally shared among the running programs. 9

10  IPC driven Assignment ◦ Considering the characteristics of the executing threads. ◦ Look at IPC number and ratio between two cores to decide the thread mapping. ◦ Thread with higher ratio run on faster core. ◦ Thread with lower ratio run on lower core. 10

11  Goal: duplicate experiment in paper (peer-reviewed)  2-phase simulation ◦ 1) Obtain IPC trace values for Spec2000 programs  Using M5 simulator  Alpha EV5 + EV6 cores ◦ 2) Use our own simulator to model various heterogeneous CMP configurations and evaluate assignment policies

12  Rsim  Simple MP  SimOS  Simic  TFsim  SimFlex  GEMS

13  What is M5 ?  A brief peek inside

14  A modular platform for simulating systems  Encompass  system-level architecture  processor microarchitecture

15  Pervasively Object-oriented  Multiple interchangeable CPU models  Event-driven memory system  Multiprocessor / multi-system capability

16 CPU L1 ca ch e L1 ca ch e BUSBUS L2 ca ch e L2 ca ch e BUSBUS Bus bridge Bus bridge Bus bridge Bus bridge Mem I/O device BUSBUS BUSBUS M5

17  CPU Models  ISA  Memory System  Cache  Buses

18 A Simple CPU model 2 Detail CPU models

19 Backward Communication FetchDecodeRename Issue execution writeback Commit

20  goal  allow human-readable ISA description  two parts ◦ A simple part- describes the decode ◦ A declaration part-describes the global information

21  Goal  combine the timing and functional models into one model  Simplify the memory system code  Make changes easier

22  cache port mem cache port Bus port mem cache port peer

23  Coherency  Prefetching BASEPrefetcher Prefetcher BHB Prefetcher StirdePrefetcher TaggedPrefetcher

24  memory, I/O, CPUs  Master- closer to memory  Slave- closer to CPU

25  Setup for M5 Simulator ◦ Window Vista running VMware on fedora core.  Download the simulator from the website. ◦ www.m5sim.org (open source) www.m5sim.org  Required Software: ◦ g++, python, scons, zlib, swig

26  FS mode ◦ Full System mode. This mode simulates a complete system including a kernel, I/O devices, etc. This mode currently only works with the ALPHA architecture.  SE mode ◦ Syscall Emulation mode. This mode simulates statically compiled binaries by functionally emulating any syscall they make.  Example of commands how to build and run M5 ◦ % scons build/ALPHA_SE/m5.debug ◦ %./build/ALPHA_SE/m5.debug config/example/se.py

27  What is cross compilation? ◦ Compiling a program for a target platform different from the platform the compiler is run on  M5 test programs must be compiled Alpha+Linux  Why? ◦ M5 implements Alpha ISA and Linux syscalls  Since we don’t own Alpha hardware: cross- compile

28  Build toolchain must be built for specific target ◦ gcc, glibc, binutils, etc.  Dan Kegel’s crosstool makes this easier:  http://www.kegel.com/crosstool http://www.kegel.com/crosstool  Of the 3 Spec2000 programs we considered, we were only able to successfully cross compile gzip

29  Scour the net until you run across this link: ◦ http://arch.cs.duke.edu/spec2000binaries.tar.bz2 http://arch.cs.duke.edu/spec2000binaries.tar.bz2 ◦ All Spec200 binaries compiled for alpha-linux!

30  ---------- Begin Simulation Statistics ----------  host_inst_rate 86899 # Simulator instruction rate (inst/s)  host_mem_usage 543680 # Number of bytes of host memory used  host_seconds 0.07 # Real time elapsed on the host  host_tick_rate 28827895 # Simulator tick rate (ticks/s)  sim_freq 1000000000000 # Frequency of simulated ticks  sim_insts 5997 # Number of instructions simulated  sim_seconds 0.000002 # Number of seconds simulated  sim_ticks 2005326 # Number of ticks simulated  system.cpu0.dtb.accesses 0 # DTB accesses  system.cpu0.dtb.acv 0 # DTB access violations  system.cpu0.dtb.hits 0 # DTB hits  system.cpu2.num_refs 1960 # Number of memory references : M5 produces simulation results at end:

31  We want IPC trace every 1 million cycles  So we patched: diff -Naur src/cpu/o3/cpu.cc /Users/klange/src/thirdparty/m5_2.0b4/src/cpu/o3/cpu.cc --- src/cpu/o3/cpu.cc2007-11-01 19:13:05.000000000 -0600 +++ /Users/klange/src/thirdparty/m5_2.0b4/src/cpu/o3/cpu.cc2007-12-01 22:54:38.000000000 -0700 @@ -422,6 +422,21 @@ ++numCycles; + ++totalCycles; // we could use numCycles...if only i could figure out how to stringificate + ++currentCycles; + if (currentCycles >= 1000000) { + double currentIpc = (double)currentCommittedInsts / (double)currentCycles; + + cout << "IPC: " + << totalCycles << "," + << totalCommittedInstsInt << "," + << currentIpc << std::endl; + + currentCommittedInsts = 0; + currentCycles = 0; + } + // activity = false; //Tick each of the stages @@ -452,8 +467,10 @@ if (removeInstsThisCycle) {

32

33

34

35  Goal: duplicate experiment in paper (peer-reviewed)  2-phase simulation ◦ 1) Obtain IPC trace values for Spec2000 programs  Using M5 simulator  Alpha EV5 + EV6 cores ◦ 2) Use our own simulator to model various heterogeneous CMP configurations and evaluate assignment policies

36  Spec 2000  Paper: ◦ - gzip ◦ - gcc ◦ crafty (chess program) ◦ parser (Natural language processor) ◦ bzip2 ◦ wupwis (quantum chromdynamics) ◦ swim (shallow water modeling) ◦ mgrid (multi-grid solver in 3d potential field) ◦ galgel (fluid dynamics modeling) ◦ equake (earthquake modeling) ◦ lucas (prime number test)  Us: ◦ gzip ◦ Bzip2 ◦ crafty

37  Spec 2000 input is proprietary  Compromise: ◦ gzip/bzip2 input: Shakespeare plays ◦ crafty input: sample chess game

38  Obtained from M5

39

40

41  java  Modular design  Core simulator module  Common thread-assignment policy interface  Policy modules  Static  Round Robin (dynamic)  IPC-Driven (dynamic)

42  Command-line interface ◦ Example: CMPSim spec2000 10 2 1 roundrobin  Input: ◦ Workload ◦ Number of threads  Selected randomly from 3 Spec 2000 programs ◦ # EV5 cores ◦ # EV6 cores ◦ Thread assignment policy

43  Output: Threads,Experiment,System IPC 1,20EV5 RR,0.905097784767538 2,20EV5 RR,1.46127036511788 3,20EV5 RR,2.06244067869053 4,20EV5 RR,2.78590633860981 5,20EV5 RR,3.35373843898152 6,20EV5 RR,4.07299579068557 7,20EV5 RR,4.17449020511364 8,20EV5 RR,4.915937425 9,20EV5 RR,5.47383727613636 10,20EV5 RR,6.00090476193182 11,20EV5 RR,6.64824888522727 12,20EV5 RR,7.26460146590909 13,20EV5 RR,7.90477401704545 14,20EV5 RR,8.46545665397727 15,20EV5 RR,9.23393584545455 16,20EV5 RR,9.80104248465909 17,20EV5 RR,10.3671315159091

44  IPC data are temporal sequences

45  Randomly assign threads to cores at startup  Repeat process whenever core becomes idle  Weaknesses: ◦ When one core becomes idle, it will persist in that state unless some unassigned thread exists. ◦ In the case of a heterogeneous system, this results in underutilization of "faster" cores. ◦ Execution of "slow" threads on "slower" cores may penalize overall system performance.

46  Randomly assign threads to cores at startup  Define swap_period  Experimentally, swap_period = 20M cycles works well  if (current_cycle % swap_period == 0) ◦ Migrate thread from EV6 -> wait queue ◦ Migrate thread from EV5 -> EV6 ◦ Migrate thread from wait queue -> EV6  When core becomes idle, assign longest- waiting thread

47  Costs ◦ Inter-core context switch  PC, registers, etc must be transferred ◦ Cache warmup  Simple model ◦ switch_loss: 50% ◦ switch_duration: 1M cycles

48  No effort is made to optimize thread-to-core mapping

49  Optimize thread-to-core mapping Define IPC ratio = EV6 IPC / EV5 IPC  Heuristic: threads with highest IPC ratio are assigned to EV6  System must compute average IPC for each core type  Requires forced migrations  To handle IPC spikes, use a weighted average: ◦ Current IPC * 0.65 + Previous IPC * 0.35

50  Randomly assign threads to cores at startup  Again, define swap_period  Experimentally, swap_period = 20M cycles works well  if (current_cycle % swap_period == 0) ◦ Sort threads by weighted IPC ratio ◦ Migrate accordingly  When core becomes idle, assign thread from wait queue with highest IPC ratio

51

52  Goal: verify results of paper  Repeat their experiments

53  Policy Comparison ◦ Static vs Round Robin vs IPC-Driven ◦ Heterogeneous system: 5 x EV5, 3 x EV6

54

55

56  Heterogeneous vs. Homogenous System Let 1 EV6 = 5 EV5  Based on die areas  Configurations ◦ 20 EV5 ◦ 10 EV5, 2 EV6 ◦ 5 EV5, 3 EV6 ◦ 4 EV6

57

58

59  Simulator neglects L2 cache contention!  Simplified thread migration model  Only used 3 spec 2000 programs ◦ Paper used 11  Didn't have access to spec 2000 inputs  Our EV5 and EV6 configurations were not perfect ◦ Lack of M5 documentation made this difficult

60  Google Code ◦ Source Control ◦ Wiki

61  Confirmed dynamic thread assignment outperforms static thread assignment  Unable to confirm heterogeneous outperforms homogenous ◦ Limitations of minimal Spec 2000 workload  Learned how to design complex, peer- reviewed experiment

62


Download ppt "Kris Lange Nopparat suwaanarat Pree Thiengburanathum."

Similar presentations


Ads by Google