Download presentation
Presentation is loading. Please wait.
Published byMalcolm Murphy Modified over 7 years ago
1
Runtime Support for Distributed Dynamic Locality COLOC / Euro-Par 2017
Tobias Fuchs, Karl Fürlinger Ludwig-Maximilians Universität (LMU) München DASH Project
2
Limitations of the Tree Model
Modern systems exhibit memory hierarchies that cannot be represented as trees Runtime Support for Dynamic Hardware Locality
3
Limitations of the Tree Model
Modern systems exhibit memory hierarchies that cannot be represented as trees Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications (Brice Goglin, 2017) Approaches to model Xeon Phi KNL in hwloc topology tree, excellent wrap-up of KNL memory modes Runtime Support for Dynamic Hardware Locality
4
Limitations of the Tree Model
Modern systems exhibit memory hierarchies that cannot be represented as trees Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications (Brice Goglin, 2017) Approaches to model Xeon Phi KNL in hwloc topology tree, excellent wrap-up of KNL memory modes Application assuming tree topology would discover cores or RAM twice in recursion of topology structure Runtime Support for Dynamic Hardware Locality
5
Limitations of the Tree Model
Modern systems exhibit memory hierarchies that cannot be represented as trees Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications (Brice Goglin, 2017) Approaches to model Xeon Phi KNL in hwloc topology tree, excellent wrap-up of KNL memory modes Runtime Support for Dynamic Hardware Locality
6
Limitations of the Tree Model
Modern systems exhibit memory hierarchies that cannot be represented as trees Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications (Brice Goglin, 2017) Approaches to model Xeon Phi KNL in hwloc topology tree, excellent wrap-up of KNL memory modes Detail on affinity and distance lost in conversion to tree structure, application cannot specify structural aspects it needs to be be maintained Runtime Support for Dynamic Hardware Locality
7
Limitations of the Tree Model
Modern systems exhibit memory hierarchies that cannot be represented as trees Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications (Brice Goglin, 2017) Approaches to model Xeon Phi KNL in hwloc topology tree, excellent wrap-up of KNL memory modes “This tree representation cannot be a perfect match for non-hierarchical NUMA interconnects but the application may still query the latency matrix to get the exact topology information if needed.” Runtime Support for Dynamic Hardware Locality
8
Alternatives to the Tree Model
Modern systems exhibit memory hierarchies that cannot be represented as trees … but: Tree topologies are a commonly understood, wide-spread mental model Majority of existing topology-aware software assumes “arborescent” memory hierarchy Complex graph of full physical topology detail is unusable as a programming abstraction, code would be graph-of-platform-specific Runtime Support for Dynamic Hardware Locality
9
Making the Tree Model Work
Physical Usable Runtime Support for Dynamic Hardware Locality
10
Making the Tree Model Work
Physical Formal Runtime Support for Dynamic Hardware Locality
11
Making the Tree Model Work
Physical Useful Runtime Support for Dynamic Hardware Locality
12
Making the Tree Model Work
Physical Useful Feasible! But requires you to “ask the right question” Runtime Support for Dynamic Hardware Locality
13
Projecting Topology Graphs to Tree Views
hwloc data dyloc runtime hwloc data hwloc topology object Retrieve detailed hardware information on each machine Application Startup Phase Application Run Runtime Support for Dynamic Hardware Locality
14
Projecting Topology Graphs to Tree Views
hwloc data dyloc runtime hwloc data static physical topology graph hwloc topology object Retrieve detailed hardware information on each machine Combine into global physical topology graph Application Startup Phase Application Run Runtime Support for Dynamic Hardware Locality
15
Projecting Topology Graphs to Tree Views
hwloc data dyloc runtime hwloc data static physical topology graph tree topology view hwloc topology object Retrieve detailed hardware information on each machine Combine into global physical topology graph Create light-weight tree-views on topology graph at run-time Topology view materialized as hwloc topology object Application Startup Phase Application Run Runtime Support for Dynamic Hardware Locality
16
Projecting Topology Graphs to Tree Views
hwloc data dyloc runtime hwloc data static physical topology graph tree topology view hwloc topology object Retrieve detailed hardware information on each machine Combine into global physical topology graph Create light-weight tree-views on topology graph at run-time Topology view materialized as hwloc topology object Application Startup Phase Application Run Runtime Support for Dynamic Hardware Locality
17
Projecting Topology Graphs to Tree Views
Separate Concerns: hwloc gather the most detailed system information possible physical topology, static dyloc find concise representation of physical topology under given constraints logical topology, dynamic views Runtime Support for Dynamic Hardware Locality
18
Locality Domain Attributes
Runtime Support for Dynamic Hardware Locality
19
Locality Domain Attributes
Runtime Support for Dynamic Hardware Locality
20
Locality Domain Attributes
Dynamic capacities depend on placement of units and their resource allocation dynamic Runtime Support for Dynamic Hardware Locality
21
Domain Grouping Programmer selects domains .1.0.0 and .1.1.0 for group
Runtime Support for Dynamic Hardware Locality
22
Domain Grouping Programmer selects domains .1.0.0 and .1.1.0 for group
Resolve lowest common ancestor of domains in group Runtime Support for Dynamic Hardware Locality
23
Domain Grouping Programmer selects domains .1.0.0 and .1.1.0 for group
Resolve lowest common ancestor of domains in group Group added as subdomain of lowest common ancestor Runtime Support for Dynamic Hardware Locality
24
Domain Grouping Programmer selects domains .1.0.0 and .1.1.0 for group
Resolve lowest common ancestor of domains in group Group added as subdomain of lowest common ancestor GROUP domains can be treated as a single regular component but they still represent the original topology of grouped components Domain tags of group are recalculated and do not collide with existing domain tags Runtime Support for Dynamic Hardware Locality
25
Use Case: Minimum Element
Find global minimum in a sequence As simple as it can get dash::Array<int> arr(size, dash::BLOCKED); // [ … init values … ] auto min = dash::min_element(arr.begin(), arr.end()); if (dash::myid() == 0) { cout << “Minimum: “ << *min << endl; } Runtime Support for Dynamic Hardware Locality
26
Use Case: Minimum Element
Find global minimum in a sequence As simple as it can get dash::Array<int> arr(size, dash::BLOCKED); // [ … init values … ] auto min = dash::min_element(arr.begin(), arr.end()); if (dash::myid() == 0) { cout << “Minimum: “ << *min << endl; } Runtime Support for Dynamic Hardware Locality
27
Use Case: Minimum Element
Find global minimum in a sequence As simple as it can get – but already revealing min_element(first, last) glob_mins = Array(nunits) local_range = local(first, last) local_min = min(local_range) glob_mins[myid] = local_min barrier() // wait for all units global_min = min(glob_mins) return global_min Runtime Support for Dynamic Hardware Locality
28
Use Case: Minimum Element
Find global minimum in a sequence As simple as it can get – but already revealing min_element(first, last) glob_mins = Array(nunits) local_range = local(first, last) local_min = min(local_range) glob_mins[myid] = local_min barrier() // wait for all units global_min = min(glob_mins) return global_min capacities: 64 GB, 32 cores capabilities: 2.8 GHz , 2 SMT threads capacities: 8 GB, 60 cores capabilities: 1.1 GHz, 4 SMT threads Runtime Support for Dynamic Hardware Locality
29
Use Case: Minimum Element
Find global minimum in a sequence Basic unbalanced variant min_element(first, last) glob_mins = Array(nunits) local_range = local(first, last) local_min = min(local_range) glob_mins[myid] = local_min barrier() // wait for all units global_min = min(glob_mins) return global_min Runtime Support for Dynamic Hardware Locality
30
Use Case: Minimum Element
Topology-aware procedure // data distribution team_topo = dyloc::TeamTopology() // partition array based on capacities // and capabilities in locality graph: pattern = dash::LoadBalancedPattern( team_topo, asize, tdesc) array = dash::Array(pattern, team) // … initialize array values … // call algorithm glob_min = min_element(array.begin(), array.end()) // algorithm min_element(first, last) glob_mins = Array(nunits) local_range = local(first, last) // threads available for this unit: nt = dyloc::UnitLocality().nthreads() Value thread_mins[nt] = { INFTY … } #pragma omp parallel for num_threads(nt) for (lval in 0 … local_range) { if (lval < tmins[tid]) lval = tmins[tid] } local_min = min(tmins) glob_mins[myid] = local_min barrier() // wait for all units global_min = min(glob_mins) return global_min Runtime Support for Dynamic Hardware Locality
31
Use Case: Minimum Element
Topology-aware procedure // algorithm min_element(first, last) glob_mins = Array(nunits) local_range = local(first, last) // threads available for this unit: nt = dyloc::UnitLocality().nthreads() Value thread_mins[nt] = { INFTY … } #pragma omp parallel for num_threads(nt) for (lval in 0 … local_range) { if (lval < tmins[tid]) lval = tmins[tid] } local_min = min(tmins) glob_mins[myid] = local_min barrier() // wait for all units global_min = min(glob_mins) return global_min Runtime Support for Dynamic Hardware Locality
32
Submit your wish list as feature requests on github
Where to go next What we can provide: Extending dyloc for the needs of hwloc/netloc and applications that use them Evaluation on exotic (“worse-than-Phi”) heterogeneous systems, benchmarks and reference applications available in DASH developer distribution Where we need help: Porting hwloc to exotic SoCs like TI’s AM57xx (suitable to build a handy heterogeneous cluster on your desk) Your expertise: approaches you tried, successes, dead ends, forgotten lore … Submit your wish list as feature requests on github dash-project.org github.com/dash-project dash-project.slack.com Runtime Support for Dynamic Hardware Locality
33
Acknowledgements Funding Team
T. Fuchs LMU, R. Kowalewski LMU, J. Schuchart HLRS, D. Hünich TUD, A. Knüpfer TUD, J. Gracia HLRS, C. Glass HLRS, H. Zhou HLRS, K. Idrees HLRS, F. Mößbauer LMU, K. Fürlinger LMU dash-project.org github.com/dash-project dash-project.slack.com Runtime Support for Dynamic Hardware Locality
34
Bonus Slides Runtime Support for Dynamic Hardware Locality
35
High Performance Megascale Computing
HPC technologies and their challenges trickled down to embedded systems years ago © 2016, Texas Instruments Inc. Source: Runtime Support for Dynamic Hardware Locality
36
High Performance Megascale Computing
HPC technologies and their challenges trickled down to embedded systems years ago Texas Instruments ported OpenMPI to a heterogeneous SoC line in MPI, OpenMP, OpenCL used for DSP offloading and communication … between nodes on chip between four architectures © 2016, Texas Instruments Inc. Source: Runtime Support for Dynamic Hardware Locality
37
Locality Domain Alias References
Runtime Support for Dynamic Hardware Locality
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.