Many-SC Project Runtime Environment (RTE) CSAP Lab 2014/10/28.

Many-SC Project Runtime Environment (RTE) CSAP Lab 2014/10/28

2Computer Systems and Platforms Lab 1 st Year Progress Overview 121234567891011 Literature survey - many-core OSes Barrelfish, Corey, Exokernel, Harmony, Tessellation, fOS - DVFS for multi/many-core architectures Random diagnostics generator - implementation of WE signal - implementation of predicate routing - testing (CSIM done, binary/RTL under way) Many-SC RTE design - services, functions provided - resource management (cores, memory, power, …) - communication - requirements - H/W - application level (OpenCL) Many-SC RTE implementation - on existing H/W (Tilera, SCC) - on simulator (  simulator team) Many-SC power mgmt - DVFS on many-core architectures (voltage/frequency island aware) - combining DVFS with OS/process migration - implemented on real H/W (Intel SCC)

3Computer Systems and Platforms Lab The Many-SC RTE Framework Overview RTE prototype host CPU/ host machine many-core architecture description many-core H/W App1:tile App2:tile scheduling result offline profiler Static Scheduler a list of target applications application profiles Many-core H/W description e.g., Many-SC, Tilera, AMD - processor description cluster:a set of tiles tile: a set of cores core:a computing unit NOC:interconnection network - memory description each memory controller Application profiles - minimal requirements (cores, memory) - performance profiles - memory access latencies online profiler application profiles application profiles Target application list - applications given - application categorization e.g., seven dwarfs Work Progress ~ 2014/10 - architecture/application description - scheduling algorithms - offline profiling tools - the prototype interact with Tilera further improvements needed

4Computer Systems and Platforms Lab The Static Scheduler Scheduling assumptions an application is restricted to one cluster an application’s address space is fixed to one memory controller no support for memory contention modeling (at the moment) Inter-cluster scheduling: cluster (and memory controller) packing round robin (current scheduler focuses on intra-cluster scheduling) elaborate cluster assignment until 2014/11 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Application 1 Application 2 Application 3 Application 4 Application 5

5Computer Systems and Platforms Lab Intra-cluster Scheduling Algorithms Fairly dividing scheduler fairly divide core/memory resources for each running application Brute force greedy scheduler allocate each core for an application which has maximum performance up with the core Hybrid scheduler combine the fairly dividing allocator with brute-force scheduling starting from fairly dividing scheduler, adjacent applications exchange their resources during iterations in a simulated annealing like heuristic way

6Computer Systems and Platforms Lab Scheduling Scenarios Target applications (openMP) matrix multiplication (dense/sparse linear algebra) fft (spectral methods) molecular dynamics (N-body methods) image blurring (structured grids) monte carlo approximation (map-reduce) Performance profiles (offline profiler)

7Computer Systems and Platforms Lab A Scheduling Result (cont’d) The scheduling result Benchmark result (on Tilera) Hybrid scheduler Brute force greedy Fairly dividing matrix FFT blur

8Computer Systems and Platforms Lab A Scheduling Result (cont’d) The scheduling result Benchmark result (on Tilera) Hybrid scheduler Brute force greedy Fairly dividing matrix FFT blur molecular dynamics monte-carlo

9Computer Systems and Platforms Lab A Scheduling Result (cont’d) The scheduling result Benchmark result (on Tilera) Hybrid scheduler Brute force greedy Fairly dividing matrix 1 matrix2 molecular dynamics monte-carlo

10Computer Systems and Platforms Lab Scheduler Evaluation For scheduling parallel applications on many cores, space-sharing scheduling results are better than time-sharing scheduling results on Linux Further improvements needed application-aware profiling  current application performance model is based on the number of tiles and the average memory access latencies (i.e., manhattan distances)  need to consider tile-interconnection patterns and routing network  need to consider memory contention-based scheduling dynamic resource management  dynamic core allocation during an application lifetime when other applications finish

11Computer Systems and Platforms Lab Conclusion and Outlook Current status (2014/10) static scheduling framework including architecture & application descriptions several static scheduling algorithms implemented and under test  greedy and heuristic algorithms  categorize target applications according to seven dwarfs Static scheduler (2014/11; 1 st year) comparison of heuristic schedulers with an optimal scheduler inter-cluster scheduling consider memory contention-based scheduling Dynamic scheduler (2 nd year) dynamic resource allocation algorithms and policies interaction with application runtime (dynamic resource mgmt.) move the RTE framework onto the Many-SC simulator

12Computer Systems and Platforms Lab Thank you Questions?

13Computer Systems and Platforms Lab Backup Slides

14Computer Systems and Platforms Lab Architecture Description Example: Many-SC prototype <architecture name=“many-sc” topology=“mesh” clusters=“4” tiles=“48” cores=“96” memories=“1” memsize=“4096”> …… …… …… <architecture name=“many-sc” topology=“mesh” clusters=“4” tiles=“48” cores=“96” memories=“1” memsize=“4096”> …… …… ……

15Computer Systems and Platforms Lab Application Profiling Example <specification architecture=“many-sc” mintiles=“1” maxtiles=“12” minmemory=“100” maxmemory=“1000” path=“./apps/matrix”> …… …… …… … <specification architecture=“many-sc” mintiles=“1” maxtiles=“12” minmemory=“100” maxmemory=“1000” path=“./apps/matrix”> …… …… …… …

16Computer Systems and Platforms Lab Scheduling Algorithms (cont’d) Fairly dividing allocator // fairly divide tiles for each application foreach (app) app->reserved = fairly divided tile number; // start from the application who has the maximum // memory controller proximity benefits (m_prior) foreach (app) { tile = cluster->GetMemoryClosestIdleTile (app->GetMemoryController()); while (app->allocated reserved) { app->allocate(tile); // find next tile and clustering // consider colored flags, degrees, and depths tile = FindNextTile(app->tilePool, CLUSTERING); } // fairly divide tiles for each application foreach (app) app->reserved = fairly divided tile number; // start from the application who has the maximum // memory controller proximity benefits (m_prior) foreach (app) { tile = cluster->GetMemoryClosestIdleTile (app->GetMemoryController()); while (app->allocated reserved) { app->allocate(tile); // find next tile and clustering // consider colored flags, degrees, and depths tile = FindNextTile(app->tilePool, CLUSTERING); }

17Computer Systems and Platforms Lab Scheduling Algorithms (cont’d) Brute force greedy scheduler // start from the tile that has the maximum // memory controller proximity benefits while (idle core exists) { tile = cluster->GetMemoryClosestIdleTile(); // find one app in a greedy way app = PeekAppThatHasMaximumScoreUpWith(tile); app->allocate(tile); } // start from the tile that has the maximum // memory controller proximity benefits while (idle core exists) { tile = cluster->GetMemoryClosestIdleTile(); // find one app in a greedy way app = PeekAppThatHasMaximumScoreUpWith(tile); app->allocate(tile); }

18Computer Systems and Platforms Lab Scheduling Algorithms (cont’d) Hybrid scheduler combines the fairly dividing allocator with brute-force scheduling FairDivideScheduler(); // do algorithm 1 do { // compute tile exchange performance benefits // for each set of two adjacent applications map > pairs; pairs = ComputeTileExchangeBenefits(allApps); // select a pair of two adjacent applications pair selectedPair = pairs.get(maxBenefit); badApp = selectedPair->first; goodApp = selectedPair->second; // reallocate one tile from badApp to goodApp ExchangeOneTile(badApp, goodApp); } while (benefit is enough); FairDivideScheduler(); // do algorithm 1 do { // compute tile exchange performance benefits // for each set of two adjacent applications map > pairs; pairs = ComputeTileExchangeBenefits(allApps); // select a pair of two adjacent applications pair selectedPair = pairs.get(maxBenefit); badApp = selectedPair->first; goodApp = selectedPair->second; // reallocate one tile from badApp to goodApp ExchangeOneTile(badApp, goodApp); } while (benefit is enough);

19Computer Systems and Platforms Lab Cluster 1 Cluster 2 Cluster 3 Cluster 4 Application 1 Application 2 Application 3 Application 4 Application 5 Application 5 Application 4 Application 3 Application 2 Application 1

20Computer Systems and Platforms Lab A Scheduling Result (cont’d) Hybrid scheduler Brute force greedy Fairly dividing matrix FFT blur

21Computer Systems and Platforms Lab A Scheduling Result (cont’d) Hybrid scheduler Brute force greedy Fairly dividing matrix FFT blur

22Computer Systems and Platforms Lab A Scheduling Result (cont’d) The scheduling result Benchmark result (on Tilera) Hybrid scheduler Brute force greedy Fairly dividing matrix FFT blur molecular dynamics monte-carlo

23Computer Systems and Platforms Lab A Scheduling Result (cont’d) The scheduling result Benchmark result (on Tilera) Hybrid scheduler Brute force greedy Fairly dividing matrix 1 matrix2 molecular dynamics monte-carlo

24Computer Systems and Platforms Lab A Scheduling Result (cont’d) The scheduling result Benchmark result (on Tilera) Hybrid scheduler Brute force greedy Fairly dividing app 1 app 2 app 3 app 4 Brute force greedyHybrid scheduler app 1 app 2 app 3 app 4

25Computer Systems and Platforms Lab A Scheduling Result (cont’d) The scheduling result Benchmark result (on Tilera) Hybrid scheduler Brute force greedy Fairly dividing matrix FFT blur molecular dynamics monte-carlo Fairly dividingBrute force greedyHybrid scheduler matrix FFT blur molecular dynamics monte-carlo

26Computer Systems and Platforms Lab A Scheduling Result (cont’d) The scheduling result Benchmark result (on Tilera) Hybrid scheduler Brute force greedy Fairly dividing matrix 1 matrix2 molecular dynamics monte-carlo Fairly dividingBrute force greedyHybrid scheduler app 1 app 2 app 3 app 4

Many-SC Project Runtime Environment (RTE) CSAP Lab 2014/10/28.

Similar presentations

Presentation on theme: "Many-SC Project Runtime Environment (RTE) CSAP Lab 2014/10/28."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Many-SC Project Runtime Environment (RTE) CSAP Lab 2014/10/28.

Similar presentations

Presentation on theme: "Many-SC Project Runtime Environment (RTE) CSAP Lab 2014/10/28."— Presentation transcript:

Similar presentations

About project

Feedback