Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploiting Dataflow Parallelism in Teradevice Computing University of Cyprus University of Manchester University of Siena University of Augsburg INRIA.

Similar presentations


Presentation on theme: "Exploiting Dataflow Parallelism in Teradevice Computing University of Cyprus University of Manchester University of Siena University of Augsburg INRIA."— Presentation transcript:

1 Exploiting Dataflow Parallelism in Teradevice Computing University of Cyprus University of Manchester University of Siena University of Augsburg INRIA Barcelona Supercomputing Center TERA LUX.EU F Roberto Giorgi – University of Siena (coordinator) CASTNESS’11 Workshop (Computing Architectures, Software tools and nano-tehnologies for Numerical and Embedded Scalable Systems). Rome 18/01/2011 (a Proposal to Harness the Future Multicores) An Overview of a TERAFLUX-like Architecture

2 Technologies for the coming years Many new technologies at the horizon: graphene, junctionless transistors… paving the way for the 1 TERA in a chip/package in a few years Feasibility also explored in EU FET projects like TRAMS 2010-09-132

3 Which Multicore Architecture for 2020? Classical 1000-Billion Euro Question ! Lessons from the past: – Message-Passing based architectures have poor PROGRAMMABILTY – Shared-Memory based architectures have limited scalability or are quite COMPLEX TO DESIGN – Failure in part of the computation compromises the whole computation: poor RELIABILITY 2010-09-133

4 CMP of the future G. Hendry, K. Bergman, “Hybrid On- chip Data Networks”, HotChips-22, Stanford, CA – Aug. 2010 1000 Billion- or 1 TERA- device computing platforms pose new challenges: – (at least) programmability, complexity of design, reliability TERAFLUX context: – High performance computing and applications (not necessarily embedded) TERAFLUX scope: – Exploiting a less exploited path (DATAFLOW) at each level of abstraction == 3D stacking

5 What we propose Exploiting dataflow concepts both – at task level and – inside the threads Offload and manage accelerated codes – to localize the computation – to respect the power/performance/temperature/reliability envelope – to efficiently handle the parallelism and have an easy and powerful execution model  PUSHING THE DATA WHERE IS NEEDED Some techniques proposed: Giorgi, R., Popovic, Z., Puzovic, N., DTA-C: A Decoupled multi-Threaded Architecture for CMP System. Computer Architecture and High Performance Computing, 2007. SBAC-PAD 2007. 19th International Symposium on, vol., no., pp.263-270, 24-27 Oct. 2007. DOI= http://dx.doi.org/10.1109/SBAC-PAD.2007.27http://dx.doi.org/10.1109/SBAC-PAD.2007.27 M. Aater Suleman, Onur Mutlu, Jose A. Joao, Khubaib, and Yale N. Patt, "Data Marshaling for Multi-core Architectures" Proceedings of the 37th International Symposium on Computer Architecture (ISCA), Saint-Malo, France, June 2010"Data Marshaling for Multi-core Architectures"37th International Symposium on Computer Architecture

6 Our pillars FIXED and MOST-USED ISA (x86) MANYCORE FULL SYSTEM SIMULATOR (COTSon) REAL WORLD APPLICATIONS (e.g. GROMACS) SYNCHRONIZATION: TRANSACTIONAL MEMORY GCC based TOOL-CHAIN OFF-THE-SHELF COMPONENTS FOR CORES, OS, NOC FDU AND TSU (Fault Detection Unit and Thread Scheduling Unit)

7 Recent Reference Numbers 26/05/2010 - CEA/BULL – TERA100 is the first European Supercomputer to reach 1 Petaflops (top500 #6 as of Nov. 2010) – Cluster of 4370 nodes with 4 X64 per node (17480 CPUs or 139840 cores); 300TB of memory; 20 PB of secondary storage; Benchmark: LINPACK; 5 MW 11/11/2010 – National Supercomputing Center in Tianjin - Tianhe- 1A – Fastest supercomputer in the worlds (2.5 Petaflops) – 7168 NVIDIA® Tesla™ M2050 GPUs (about 3.2 Million CUDA cores) and 14336 CPUs or 86016 cores; 230 TB of memory; Benchmark: LINPACK; 4 MW 2010-09-137

8 DRAM NoC SC AC=Auxiliary Core SC=Service Core IOx=I/O or SC Core IO1 (disk) IO2 (Keyboard) IO3 (NIC) AC A possible TERAFLUX architectural instance AC Core (PE,L1$,L2$-partition) Uncore (TSU/FDU, NoC-tap, …) TSU=Thread Scheduling Unit FDU=Fault Detection Unit

9 9 Abstraction Layer and Reliability Layer Compilation Tools Source code Programming Model Data dependencies Transactional memory Teradevice hardware (simulated) Threads Virtual CPUs Extract TLP Locality optimizations T1 T2 possibly 1,000-10,000 cores... WP2 WP3 WP4 WP5 WP7 WP6 VCPU PC PPCPU PCPCPU APPLICATIONS Holistic approach

10 TERAFLUX: toward a different world Relying on existing architectures as much as possible and introduce key modifications to enhance programmability, simplicity of design, reliability. – not a brand new language, but leverage and extend other open efforts [C+TM, SCALA, OPEN-MP] – not a brand new system, but leverage and extend other open software frameworks [GCC] – not a brand new CPU architecture, but leverage and extend industry standard commodities [x86] However: the implications on “classical limitations” can be huge – requirements of the hardware memory architecture which limit extensibility (a.k.a. scalability) can be relaxed significantly – Turning dataflow model into a general purpose approach through the addition of transactions

11 Top-Level ARCHITECTURAL design: A view from 1000 feet Pool of MANY Asymmetric Cores based on X86-64 ISA on a single chip (e.g. 1000 cores or more) (some NoC, some Memory Hierarchy, some I/O, some physical layout, (e.g., 3D multi-chip, off-the-shelf LINUX)  not in the scope of TERAFLUX – Some options will be however proposed/explored TERAFLUX Baseline Machine: the simplest thing we have now E.g., 64 nodes by 16 cores, L1, L2, hierarchical interconnections – Need to evolve this architecture WITHOUT binding the software to it, to let the Architecture to fully explore dataflow concepts at machine level Major “cross-challenge”: how to integrate the contribution by each WP so that the work is done toward a higher goal we could NOT reach as separate WPs

12 TERAFLUX key results we are aiming at & long term impact Coarse grain dataflow model (or fine grain multithreaded model) – fine grain transactional isolation – scalable to many cores and distributed memory – with built-in application-unaware resilience – with novel hardware support structures as needed A solid and open evaluation platform based on an x86 simulator based on COTSon by TERAFLUX Partner HPLabs ( http://cotson.sourceforge.net/ )http://cotson.sourceforge.net/ – enables leveraging the large software body out there (OS, middleware, libraries, applications) We are available for cooperation on COTSon also with other EU projects (especially TERACOMP projects) RESEARCH PAPERS: http://teraflux.eu/Publicationshttp://teraflux.eu/Publications TERADEVICE SIMULATOR: http://cotson.sourceforge.nethttp://cotson.sourceforge.net

13 Conclusions: Major Technical Innovations in TERAFLUX Fragmenting the Applications in Finer grained DF-threads: – DF-threads allow an easy way to decouple memory accesses, therefore hiding memory latencies, balancing the load, managing fault, temperature information without fine grain intervention of the software. Possibility to repeat the execution of a DF-thread in case this thread happened to be on a core later discovered as faulty Taking advantage of a “direct” dataflow communication of the data (through what we call DF-frames). Synchronizing threads while taking advantage of native dataflow mechanism (e.g. several threads can be synchronized at a barrier) – DF-threads allow (atomic ) Transactional semantics (DF meets TM) A Thread Scheduling Unit would allow fast thread switching and scheduling, besides the OS scheduler; scalable and distributed A Fault Detection Unit works in conjunction with TSU


Download ppt "Exploiting Dataflow Parallelism in Teradevice Computing University of Cyprus University of Manchester University of Siena University of Augsburg INRIA."

Similar presentations


Ads by Google