Runtime Support for Distributed Dynamic Locality COLOC / Euro-Par 2017

Similar presentations


Presentation on theme: "Runtime Support for Distributed Dynamic Locality COLOC / Euro-Par 2017"— Presentation transcript:

1 Runtime Support for Distributed Dynamic Locality COLOC / Euro-Par 2017
Tobias Fuchs, Karl Fürlinger Ludwig-Maximilians Universität (LMU) München DASH Project

2 Limitations of the Tree Model
Modern systems exhibit memory hierarchies that cannot be represented as trees Runtime Support for Dynamic Hardware Locality

3 Limitations of the Tree Model
Modern systems exhibit memory hierarchies that cannot be represented as trees Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications (Brice Goglin, 2017) Approaches to model Xeon Phi KNL in hwloc topology tree, excellent wrap-up of KNL memory modes Runtime Support for Dynamic Hardware Locality

4 Limitations of the Tree Model
Modern systems exhibit memory hierarchies that cannot be represented as trees Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications (Brice Goglin, 2017) Approaches to model Xeon Phi KNL in hwloc topology tree, excellent wrap-up of KNL memory modes Application assuming tree topology would discover cores or RAM twice in recursion of topology structure Runtime Support for Dynamic Hardware Locality

5 Limitations of the Tree Model
Modern systems exhibit memory hierarchies that cannot be represented as trees Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications (Brice Goglin, 2017) Approaches to model Xeon Phi KNL in hwloc topology tree, excellent wrap-up of KNL memory modes Runtime Support for Dynamic Hardware Locality

6 Limitations of the Tree Model
Modern systems exhibit memory hierarchies that cannot be represented as trees Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications (Brice Goglin, 2017) Approaches to model Xeon Phi KNL in hwloc topology tree, excellent wrap-up of KNL memory modes Detail on affinity and distance lost in conversion to tree structure, application cannot specify structural aspects it needs to be be maintained Runtime Support for Dynamic Hardware Locality

7 Limitations of the Tree Model
Modern systems exhibit memory hierarchies that cannot be represented as trees Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications (Brice Goglin, 2017) Approaches to model Xeon Phi KNL in hwloc topology tree, excellent wrap-up of KNL memory modes “This tree representation cannot be a perfect match for non-hierarchical NUMA interconnects but the application may still query the latency matrix to get the exact topology information if needed.” Runtime Support for Dynamic Hardware Locality

8 Alternatives to the Tree Model
Modern systems exhibit memory hierarchies that cannot be represented as trees … but: Tree topologies are a commonly understood, wide-spread mental model Majority of existing topology-aware software assumes “arborescent” memory hierarchy Complex graph of full physical topology detail is unusable as a programming abstraction, code would be graph-of-platform-specific Runtime Support for Dynamic Hardware Locality

9 Making the Tree Model Work
Physical Usable Runtime Support for Dynamic Hardware Locality

10 Making the Tree Model Work
Physical Formal Runtime Support for Dynamic Hardware Locality

11 Making the Tree Model Work
Physical Useful Runtime Support for Dynamic Hardware Locality

12 Making the Tree Model Work
Physical Useful Feasible! But requires you to “ask the right question” Runtime Support for Dynamic Hardware Locality

13 Projecting Topology Graphs to Tree Views
hwloc data dyloc runtime hwloc data hwloc topology object Retrieve detailed hardware information on each machine Application Startup Phase Application Run Runtime Support for Dynamic Hardware Locality

14 Projecting Topology Graphs to Tree Views
hwloc data dyloc runtime hwloc data static physical topology graph hwloc topology object Retrieve detailed hardware information on each machine Combine into global physical topology graph Application Startup Phase Application Run Runtime Support for Dynamic Hardware Locality

15 Projecting Topology Graphs to Tree Views
hwloc data dyloc runtime hwloc data static physical topology graph tree topology view hwloc topology object Retrieve detailed hardware information on each machine Combine into global physical topology graph Create light-weight tree-views on topology graph at run-time Topology view materialized as hwloc topology object Application Startup Phase Application Run Runtime Support for Dynamic Hardware Locality

16 Projecting Topology Graphs to Tree Views
hwloc data dyloc runtime hwloc data static physical topology graph tree topology view hwloc topology object Retrieve detailed hardware information on each machine Combine into global physical topology graph Create light-weight tree-views on topology graph at run-time Topology view materialized as hwloc topology object Application Startup Phase Application Run Runtime Support for Dynamic Hardware Locality

17 Projecting Topology Graphs to Tree Views
Separate Concerns: hwloc gather the most detailed system information possible  physical topology, static dyloc find concise representation of physical topology under given constraints  logical topology, dynamic views Runtime Support for Dynamic Hardware Locality

18 Locality Domain Attributes
Runtime Support for Dynamic Hardware Locality

19 Locality Domain Attributes
Runtime Support for Dynamic Hardware Locality

20 Locality Domain Attributes
Dynamic capacities depend on placement of units and their resource allocation dynamic Runtime Support for Dynamic Hardware Locality

21 Domain Grouping Programmer selects domains .1.0.0 and .1.1.0 for group
Runtime Support for Dynamic Hardware Locality

22 Domain Grouping Programmer selects domains .1.0.0 and .1.1.0 for group
Resolve lowest common ancestor of domains in group Runtime Support for Dynamic Hardware Locality

23 Domain Grouping Programmer selects domains .1.0.0 and .1.1.0 for group
Resolve lowest common ancestor of domains in group Group added as subdomain of lowest common ancestor Runtime Support for Dynamic Hardware Locality

24 Domain Grouping Programmer selects domains .1.0.0 and .1.1.0 for group
Resolve lowest common ancestor of domains in group Group added as subdomain of lowest common ancestor GROUP domains can be treated as a single regular component but they still represent the original topology of grouped components Domain tags of group are recalculated and do not collide with existing domain tags Runtime Support for Dynamic Hardware Locality

25 Use Case: Minimum Element
Find global minimum in a sequence As simple as it can get dash::Array<int> arr(size, dash::BLOCKED); // [ … init values … ] auto min = dash::min_element(arr.begin(), arr.end()); if (dash::myid() == 0) { cout << “Minimum: “ << *min << endl; } Runtime Support for Dynamic Hardware Locality

26 Use Case: Minimum Element
Find global minimum in a sequence As simple as it can get dash::Array<int> arr(size, dash::BLOCKED); // [ … init values … ] auto min = dash::min_element(arr.begin(), arr.end()); if (dash::myid() == 0) { cout << “Minimum: “ << *min << endl; } Runtime Support for Dynamic Hardware Locality

27 Use Case: Minimum Element
Find global minimum in a sequence As simple as it can get – but already revealing min_element(first, last) glob_mins = Array(nunits) local_range = local(first, last) local_min = min(local_range) glob_mins[myid] = local_min barrier() // wait for all units global_min = min(glob_mins) return global_min Runtime Support for Dynamic Hardware Locality

28 Use Case: Minimum Element
Find global minimum in a sequence As simple as it can get – but already revealing min_element(first, last) glob_mins = Array(nunits) local_range = local(first, last) local_min = min(local_range) glob_mins[myid] = local_min barrier() // wait for all units global_min = min(glob_mins) return global_min capacities: 64 GB, 32 cores capabilities: 2.8 GHz , 2 SMT threads capacities: 8 GB, 60 cores capabilities: 1.1 GHz, 4 SMT threads Runtime Support for Dynamic Hardware Locality

29 Use Case: Minimum Element
Find global minimum in a sequence Basic unbalanced variant min_element(first, last) glob_mins = Array(nunits) local_range = local(first, last) local_min = min(local_range) glob_mins[myid] = local_min barrier() // wait for all units global_min = min(glob_mins) return global_min Runtime Support for Dynamic Hardware Locality

30 Use Case: Minimum Element
Topology-aware procedure // data distribution team_topo = dyloc::TeamTopology() // partition array based on capacities // and capabilities in locality graph: pattern = dash::LoadBalancedPattern( team_topo, asize, tdesc) array = dash::Array(pattern, team) // … initialize array values … // call algorithm glob_min = min_element(array.begin(), array.end()) // algorithm min_element(first, last) glob_mins = Array(nunits) local_range = local(first, last) // threads available for this unit: nt = dyloc::UnitLocality().nthreads() Value thread_mins[nt] = { INFTY … } #pragma omp parallel for num_threads(nt) for (lval in 0 … local_range) { if (lval < tmins[tid]) lval = tmins[tid] } local_min = min(tmins) glob_mins[myid] = local_min barrier() // wait for all units global_min = min(glob_mins) return global_min Runtime Support for Dynamic Hardware Locality

31 Use Case: Minimum Element
Topology-aware procedure // algorithm min_element(first, last) glob_mins = Array(nunits) local_range = local(first, last) // threads available for this unit: nt = dyloc::UnitLocality().nthreads() Value thread_mins[nt] = { INFTY … } #pragma omp parallel for num_threads(nt) for (lval in 0 … local_range) { if (lval < tmins[tid]) lval = tmins[tid] } local_min = min(tmins) glob_mins[myid] = local_min barrier() // wait for all units global_min = min(glob_mins) return global_min Runtime Support for Dynamic Hardware Locality

32 Submit your wish list as feature requests on github
Where to go next What we can provide: Extending dyloc for the needs of hwloc/netloc and applications that use them Evaluation on exotic (“worse-than-Phi”) heterogeneous systems, benchmarks and reference applications available in DASH developer distribution Where we need help: Porting hwloc to exotic SoCs like TI’s AM57xx (suitable to build a handy heterogeneous cluster on your desk) Your expertise: approaches you tried, successes, dead ends, forgotten lore … Submit your wish list as feature requests on github dash-project.org github.com/dash-project dash-project.slack.com Runtime Support for Dynamic Hardware Locality

33 Acknowledgements Funding Team
T. Fuchs LMU, R. Kowalewski LMU, J. Schuchart HLRS, D. Hünich TUD, A. Knüpfer TUD, J. Gracia HLRS, C. Glass HLRS, H. Zhou HLRS, K. Idrees HLRS, F. Mößbauer LMU, K. Fürlinger LMU dash-project.org github.com/dash-project dash-project.slack.com Runtime Support for Dynamic Hardware Locality

34 Bonus Slides Runtime Support for Dynamic Hardware Locality

35 High Performance Megascale Computing
HPC technologies and their challenges trickled down to embedded systems years ago © 2016, Texas Instruments Inc. Source: Runtime Support for Dynamic Hardware Locality

36 High Performance Megascale Computing
HPC technologies and their challenges trickled down to embedded systems years ago Texas Instruments ported OpenMPI to a heterogeneous SoC line in MPI, OpenMP, OpenCL used for DSP offloading and communication … between nodes on chip between four architectures © 2016, Texas Instruments Inc. Source: Runtime Support for Dynamic Hardware Locality

37 Locality Domain Alias References
Runtime Support for Dynamic Hardware Locality


Download ppt "Runtime Support for Distributed Dynamic Locality COLOC / Euro-Par 2017"

Similar presentations


Ads by Google