Download presentation
Presentation is loading. Please wait.
Published byVictor Lucas Modified over 8 years ago
1
Kris Lange Nopparat suwaanarat Pree Thiengburanathum
2
Introduction Motivation Review concepts M5 architecture Configuring M5 Simulator Simulation Results and Analysis Conclusion
3
Basis: "Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures“ Paper makes 2 claims ◦ Heterogeneous CMP outperform homogenous CMP (for a fixed total die size) ◦ Benefits of heterogeneous CMP are enhanced using dynamic thread assignment policies
4
Gain deeper understanding of research paper Verify results of this paper Gain hands-on experience running a peer- reviewed experiment
5
Heterogeneous CMP system Homogeneous CMP system Heterogeneous VS Homogenous in multi- programmed.
6
Heterogeneous CMP system Many simple cores = higher thread parallelism Fewer cores, larger = lower thread parallelism We want to maximize resource utilization and achieve high degree of inter-thread parallelism. How? Mapping running tasks and using control mechanism.
7
Which one has a better total execution time? Control mechanism: Thread Assignment Policies: Static thread assignment random best Dynamic thread assignment round robin IPC driven P1P2 Thread A1.60.4 Thread B1.51
8
Static thread Assignment Usually assign thread to the faster core. Well studies problem before assign. Solution rely on heuristics a random static assignment. Don’t know the work loads and IPC. a pseudo best static assignment. Know the work loads and IPC, use heuristic to find out. Disadvantages: Doesn’t assign thread in run time. does not optimize faster core(s) usage. slow” threads on slower core(s) penalize overall system performance. 8
9
Dynamic thread assignment ◦ Round Robin Assignment rotating the assignment of threads to processors in a round robin fashion. ensures that the available faster are equally shared among the running programs. 9
10
IPC driven Assignment ◦ Considering the characteristics of the executing threads. ◦ Look at IPC number and ratio between two cores to decide the thread mapping. ◦ Thread with higher ratio run on faster core. ◦ Thread with lower ratio run on lower core. 10
11
Goal: duplicate experiment in paper (peer-reviewed) 2-phase simulation ◦ 1) Obtain IPC trace values for Spec2000 programs Using M5 simulator Alpha EV5 + EV6 cores ◦ 2) Use our own simulator to model various heterogeneous CMP configurations and evaluate assignment policies
12
Rsim Simple MP SimOS Simic TFsim SimFlex GEMS
13
What is M5 ? A brief peek inside
14
A modular platform for simulating systems Encompass system-level architecture processor microarchitecture
15
Pervasively Object-oriented Multiple interchangeable CPU models Event-driven memory system Multiprocessor / multi-system capability
16
CPU L1 ca ch e L1 ca ch e BUSBUS L2 ca ch e L2 ca ch e BUSBUS Bus bridge Bus bridge Bus bridge Bus bridge Mem I/O device BUSBUS BUSBUS M5
17
CPU Models ISA Memory System Cache Buses
18
A Simple CPU model 2 Detail CPU models
19
Backward Communication FetchDecodeRename Issue execution writeback Commit
20
goal allow human-readable ISA description two parts ◦ A simple part- describes the decode ◦ A declaration part-describes the global information
21
Goal combine the timing and functional models into one model Simplify the memory system code Make changes easier
22
cache port mem cache port Bus port mem cache port peer
23
Coherency Prefetching BASEPrefetcher Prefetcher BHB Prefetcher StirdePrefetcher TaggedPrefetcher
24
memory, I/O, CPUs Master- closer to memory Slave- closer to CPU
25
Setup for M5 Simulator ◦ Window Vista running VMware on fedora core. Download the simulator from the website. ◦ www.m5sim.org (open source) www.m5sim.org Required Software: ◦ g++, python, scons, zlib, swig
26
FS mode ◦ Full System mode. This mode simulates a complete system including a kernel, I/O devices, etc. This mode currently only works with the ALPHA architecture. SE mode ◦ Syscall Emulation mode. This mode simulates statically compiled binaries by functionally emulating any syscall they make. Example of commands how to build and run M5 ◦ % scons build/ALPHA_SE/m5.debug ◦ %./build/ALPHA_SE/m5.debug config/example/se.py
27
What is cross compilation? ◦ Compiling a program for a target platform different from the platform the compiler is run on M5 test programs must be compiled Alpha+Linux Why? ◦ M5 implements Alpha ISA and Linux syscalls Since we don’t own Alpha hardware: cross- compile
28
Build toolchain must be built for specific target ◦ gcc, glibc, binutils, etc. Dan Kegel’s crosstool makes this easier: http://www.kegel.com/crosstool http://www.kegel.com/crosstool Of the 3 Spec2000 programs we considered, we were only able to successfully cross compile gzip
29
Scour the net until you run across this link: ◦ http://arch.cs.duke.edu/spec2000binaries.tar.bz2 http://arch.cs.duke.edu/spec2000binaries.tar.bz2 ◦ All Spec200 binaries compiled for alpha-linux!
30
---------- Begin Simulation Statistics ---------- host_inst_rate 86899 # Simulator instruction rate (inst/s) host_mem_usage 543680 # Number of bytes of host memory used host_seconds 0.07 # Real time elapsed on the host host_tick_rate 28827895 # Simulator tick rate (ticks/s) sim_freq 1000000000000 # Frequency of simulated ticks sim_insts 5997 # Number of instructions simulated sim_seconds 0.000002 # Number of seconds simulated sim_ticks 2005326 # Number of ticks simulated system.cpu0.dtb.accesses 0 # DTB accesses system.cpu0.dtb.acv 0 # DTB access violations system.cpu0.dtb.hits 0 # DTB hits system.cpu2.num_refs 1960 # Number of memory references : M5 produces simulation results at end:
31
We want IPC trace every 1 million cycles So we patched: diff -Naur src/cpu/o3/cpu.cc /Users/klange/src/thirdparty/m5_2.0b4/src/cpu/o3/cpu.cc --- src/cpu/o3/cpu.cc2007-11-01 19:13:05.000000000 -0600 +++ /Users/klange/src/thirdparty/m5_2.0b4/src/cpu/o3/cpu.cc2007-12-01 22:54:38.000000000 -0700 @@ -422,6 +422,21 @@ ++numCycles; + ++totalCycles; // we could use numCycles...if only i could figure out how to stringificate + ++currentCycles; + if (currentCycles >= 1000000) { + double currentIpc = (double)currentCommittedInsts / (double)currentCycles; + + cout << "IPC: " + << totalCycles << "," + << totalCommittedInstsInt << "," + << currentIpc << std::endl; + + currentCommittedInsts = 0; + currentCycles = 0; + } + // activity = false; //Tick each of the stages @@ -452,8 +467,10 @@ if (removeInstsThisCycle) {
35
Goal: duplicate experiment in paper (peer-reviewed) 2-phase simulation ◦ 1) Obtain IPC trace values for Spec2000 programs Using M5 simulator Alpha EV5 + EV6 cores ◦ 2) Use our own simulator to model various heterogeneous CMP configurations and evaluate assignment policies
36
Spec 2000 Paper: ◦ - gzip ◦ - gcc ◦ crafty (chess program) ◦ parser (Natural language processor) ◦ bzip2 ◦ wupwis (quantum chromdynamics) ◦ swim (shallow water modeling) ◦ mgrid (multi-grid solver in 3d potential field) ◦ galgel (fluid dynamics modeling) ◦ equake (earthquake modeling) ◦ lucas (prime number test) Us: ◦ gzip ◦ Bzip2 ◦ crafty
37
Spec 2000 input is proprietary Compromise: ◦ gzip/bzip2 input: Shakespeare plays ◦ crafty input: sample chess game
38
Obtained from M5
41
java Modular design Core simulator module Common thread-assignment policy interface Policy modules Static Round Robin (dynamic) IPC-Driven (dynamic)
42
Command-line interface ◦ Example: CMPSim spec2000 10 2 1 roundrobin Input: ◦ Workload ◦ Number of threads Selected randomly from 3 Spec 2000 programs ◦ # EV5 cores ◦ # EV6 cores ◦ Thread assignment policy
43
Output: Threads,Experiment,System IPC 1,20EV5 RR,0.905097784767538 2,20EV5 RR,1.46127036511788 3,20EV5 RR,2.06244067869053 4,20EV5 RR,2.78590633860981 5,20EV5 RR,3.35373843898152 6,20EV5 RR,4.07299579068557 7,20EV5 RR,4.17449020511364 8,20EV5 RR,4.915937425 9,20EV5 RR,5.47383727613636 10,20EV5 RR,6.00090476193182 11,20EV5 RR,6.64824888522727 12,20EV5 RR,7.26460146590909 13,20EV5 RR,7.90477401704545 14,20EV5 RR,8.46545665397727 15,20EV5 RR,9.23393584545455 16,20EV5 RR,9.80104248465909 17,20EV5 RR,10.3671315159091
44
IPC data are temporal sequences
45
Randomly assign threads to cores at startup Repeat process whenever core becomes idle Weaknesses: ◦ When one core becomes idle, it will persist in that state unless some unassigned thread exists. ◦ In the case of a heterogeneous system, this results in underutilization of "faster" cores. ◦ Execution of "slow" threads on "slower" cores may penalize overall system performance.
46
Randomly assign threads to cores at startup Define swap_period Experimentally, swap_period = 20M cycles works well if (current_cycle % swap_period == 0) ◦ Migrate thread from EV6 -> wait queue ◦ Migrate thread from EV5 -> EV6 ◦ Migrate thread from wait queue -> EV6 When core becomes idle, assign longest- waiting thread
47
Costs ◦ Inter-core context switch PC, registers, etc must be transferred ◦ Cache warmup Simple model ◦ switch_loss: 50% ◦ switch_duration: 1M cycles
48
No effort is made to optimize thread-to-core mapping
49
Optimize thread-to-core mapping Define IPC ratio = EV6 IPC / EV5 IPC Heuristic: threads with highest IPC ratio are assigned to EV6 System must compute average IPC for each core type Requires forced migrations To handle IPC spikes, use a weighted average: ◦ Current IPC * 0.65 + Previous IPC * 0.35
50
Randomly assign threads to cores at startup Again, define swap_period Experimentally, swap_period = 20M cycles works well if (current_cycle % swap_period == 0) ◦ Sort threads by weighted IPC ratio ◦ Migrate accordingly When core becomes idle, assign thread from wait queue with highest IPC ratio
52
Goal: verify results of paper Repeat their experiments
53
Policy Comparison ◦ Static vs Round Robin vs IPC-Driven ◦ Heterogeneous system: 5 x EV5, 3 x EV6
56
Heterogeneous vs. Homogenous System Let 1 EV6 = 5 EV5 Based on die areas Configurations ◦ 20 EV5 ◦ 10 EV5, 2 EV6 ◦ 5 EV5, 3 EV6 ◦ 4 EV6
59
Simulator neglects L2 cache contention! Simplified thread migration model Only used 3 spec 2000 programs ◦ Paper used 11 Didn't have access to spec 2000 inputs Our EV5 and EV6 configurations were not perfect ◦ Lack of M5 documentation made this difficult
60
Google Code ◦ Source Control ◦ Wiki
61
Confirmed dynamic thread assignment outperforms static thread assignment Unable to confirm heterogeneous outperforms homogenous ◦ Limitations of minimal Spec 2000 workload Learned how to design complex, peer- reviewed experiment
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.