1 of 14 Lab 2: Design-Space Exploration with MPARM.

1 of 14 Lab 2: Design-Space Exploration with MPARM

2 of 14 2 Outline  The MPARM simulation framework  Hardware platform  Software platform  Design-space exploration for energy minimization  Communication in MPSoC  Mapping and scheduling

3 of 14 3 MPSoC Architecture Bus ARM Interrupt Device Private Memory Private Memory Private Memory Semaphore Device Shared Memory CACHE

4 of 14 4 System Design Flow Informal specification, constraints Modeling System model Mapped and scheduled model Estimation System architecture Architecture selection Hardware and Software Implementation Prototype Fabrication ok Testing Functional simulation not ok Mapping Scheduling not ok

5 of 14 5 System Design Flow Hardware platform Software Application(s) Extract Task Graph Extract Task Parameters Optimize (Mapping & Sched) Implement Formal Simulation

6 of 14 6 MPARM: Hardware  Up to eight ARM7 processors  Variable frequency (dynamic and static)  Instruction and data caches  Private memory  Scratchpad  Shared memory  Communication bus  Read more in:  /home/TDTS07/sw/mparm/MPARM/doc/simulator_statistics.txt

7 of 14 7 MPARM: Software  C crosscompiler toolchain for building software applications  No operating system  Small set of functions (such as WAIT and SIGNAL)  (look in the application code)

8 of 14 8 MPARM: Why?  Cycle-accurate simulation of the system  Various statistics: number of clock cycles executed, bus utilization, cache efficiency, and energy/power consumption of the components (CPUs, buses, and memories)

9 of 14 9 MPARM: How?  mpsim.x -c2 — run on two processors, collecting default statistics  mpsim.x -c2 -w — run on two processors, collecting power/energy statistics  mpsim.x -c1 --is=9 --ds=10 — run on one processor with instruction cache of 512 bytes and data cache of 1024 bytes  mpsim.x -c2 -F0,2 -F1,1 -F3,3 — run on two processors operating at 100 MHz and 200 MHz and the bus operating at 66 MHz  200 MHz is the ”default” frequency  mpsim.x -h — show other options  Simulation results are in the file stats.txt

10 of 14 10 Design-Space Exploration  Platform optimization  Select the number of processors  Select the speed of each processor  Select the type, associativity, and size of the cache  Select the bus type  Application optimization  Select the interprocessor communication style (shared memory or distributed message passing)  Select the best mapping and schedule

11 of 14 11  Given a GSM codec  Running on one ARM7 processor  Variables are:  cache parameters  processor frequency  Using MPARM, find a hardware configuration that minimizes the energy of the system Assignment 1

12 of 14 12 Energy/Speed Tradeoff CPU model

13 of 14 13 Frequency Selection: ARM Core Energy 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 1 1.5 2 2.5 3 3.5 4 Energy [mJ] Freq. divider

14 of 14 14 Frequency Selection: Total Energy 8.5 9.0 9.5 10 10.5 11 1 1.5 2 2.5 3 3.5 4 Energy [mJ] Freq. divider

15 of 14 15 Instruction Cache Size: Total Energy log2(CacheSize) 8.50 9.00 9.50 10.0 10.5 11.0 11.5 12.0 12.5 9 10 11 12 13 14 Energy [mJ] 2 9 =512 bytes 2 14 =16 kbytes

16 of 14 16 Instruction Cache Size: Execution Time t [cycles] 5e+07 5.5e+07 6e+07 6.5e+07 7e+07 7.5e+07 8e+07 8.5e+07 9e+07 9.5e+07 1e+08 9 10 11 12 13 14 log2(CacheSize)

17 of 14 17 Interprocessor Data Communication CPU 1 CPU 2... a =1... print a ;... BUS How?

18 of 14 18 Shared Memory CPU 1 CPU 2... a =1... print a ; BUS Shared Mem a a =2 a=? Synchronization

19 of 14 19 Synchronization CPU 1 CPU 2 a =1 signal(sem_a) print a ; Shared Mem a a =2 Semaphore sem_a wait(sem_a) a =2 With semaphores BUS

20 of 14 20 Synchronization Internals (1) CPU 1 CPU 2 a =1 signal(sem_a) print a ; Shared Mem a a =2 Semaphore sem_a while (sem_a==0) wait(sem_a) polling sem_a=1 BUS

21 of 14 21 Synchronization Internals (2)  Disadvantages of polling  Results in higher power consumption (energy from the battery)  Larger execution time of the application  Blocking important communication on the bus

22 of 14 22 Distributed Message Passing  Instead  Direct CPU-CPU communication with distributed semaphores  Each CPU has its own scratchpad  Smaller and faster than a RAM  Smaller energy consumption than a cache  Put frequently used variables on the scratchpad  Cache controlled by hardware (cache lines, hits/misses, …)  Scratchpad controlled by software (e.g., compiler)  Semaphores allocated on scratchpads  No polling

23 of 14 23 Distributed Message Passing (1) CPU 1 a=1 signal(sem_a) BUS Shared Mem a CPU 2 print a; a=2 wait(sem_a) sem_a

24 of 14 24 Distributed Message Passing (2) CPU 1 (prod) signal(sem_a) BUS CPU 2 (cons) print a; wait(sem_a) sem_a a=1

25 of 14 25 Assignment 2  Given two implementations of the GSM codec  Shared memory  Distributed message passing  Simulate and compare these two approaches  Energy  Runtime

26 of 14 26 Thank you! Questions?

1 of 14 Lab 2: Design-Space Exploration with MPARM.

Similar presentations

Presentation on theme: "1 of 14 Lab 2: Design-Space Exploration with MPARM."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 of 14 Lab 2: Design-Space Exploration with MPARM.

Similar presentations

Presentation on theme: "1 of 14 Lab 2: Design-Space Exploration with MPARM."— Presentation transcript:

Similar presentations

About project

Feedback