Special-purpose computers for scientific simulations Makoto Taiji Processor Research Team RIKEN Advanced Institute for Computational Science Computational.

Slides:



Advertisements
Similar presentations
Challenge: Computing for the Life Sciences Gaining insight into protein science and protein folding mechanisms Simulating (thousands of) macromolecules.
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Parallel computer architecture classification
Supercomputing Challenges at the National Center for Atmospheric Research Dr. Richard Loft Computational Science Section Scientific Computing Division.
Beowulf Supercomputer System Lee, Jung won CS843.
K-computer and Supercomputing Projects in Japan Makoto Taiji Computational Biology Research Core RIKEN Planning Office for the Center for Computational.
Petaflops Special-Purpose Computer for Molecular Dynamics Simulations Makoto Taiji High-Performance Molecular Simulation Team Computational & Experimental.
10/21/20091 Protein Explorer: A Petaflops Special-Purpose Computer System for Molecular Dynamics Simulations Makoto Taiji, Tetsu Narumi, Yousuke Ohno,
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
The Protein Folding Problem David van der Spoel Dept. of Cell & Mol. Biology Uppsala, Sweden
Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.
A many-core GPU architecture.. Price, performance, and evolution.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Parallel Algorithms - Introduction Advanced Algorithms & Data Structures Lecture Theme 11 Prof. Dr. Th. Ottmann Summer Semester 2006.
Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements.
Lecture 1: Introduction to High Performance Computing.
‘Tis not folly to dream: Using Molecular Dynamics to Solve Problems in Chemistry Christopher Adam Hixson and Ralph A. Wheeler Dept. of Chemistry and Biochemistry,
1 VLSI and Computer Architecture Trends ECE 25 Fall 2012.
ANTON D.E Shaw Research. Force Fields: Typical Energy Functions Bond stretches Angle bending Torsional rotation Improper torsion (sp2) Electrostatic interaction.
“The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı.
CCS machine development plan for post- peta scale computing and Japanese the next generation supercomputer project Mitsuhisa Sato CCS, University of Tsukuba.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.
R. Ryne, NUG mtg: Page 1 High Energy Physics Greenbook Presentation Robert D. Ryne Lawrence Berkeley National Laboratory NERSC User Group Meeting.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
80-Tile Teraflop Network-On- Chip 1. Contents Overview of the chip Architecture ▫Computational Core ▫Mesh Network Router ▫Power save features Performance.
- Rohan Dhamnaskar. Overview  What is a Supercomputer  Some Concepts  Couple of examples.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Protein Explorer: A Petaflops Special Purpose Computer System for Molecular Dynamics Simulations David Gobaud Computational Drug Discovery Stanford University.
Molecular Dynamics Simulations of Proteins with Petascale Special-Purpose Computer MDGRAPE-3 Makoto Taiji Deputy Project Director Computational & Experimental.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Brent Gorda LBNL – SOS7 3/5/03 1 Planned Machines: BluePlanet SOS7 March 5, 2003 Brent Gorda Future Technologies Group Lawrence Berkeley.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.
ANTON D.E Shaw Research.
Workshop on Parallelization of Coupled-Cluster Methods Panel 1: Parallel efficiency An incomplete list of thoughts Bert de Jong High Performance Software.
Anton, a Special-Purpose Machine for Molecular Dynamics Simulation By David E. Shaw et al Presented by Bob Koutsoyannis.
Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals.
Interconnection network network interface and a case study.
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri.
PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,
New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
William Stallings Computer Organization and Architecture 6th Edition
Lynn Choi School of Electrical Engineering
Setup distribution of N particles
Parallel computer architecture classification
Super Computing By RIsaj t r S3 ece, roll 50.
Architecture & Organization 1
Architecture & Organization 1
BIC 10503: COMPUTER ARCHITECTURE
with Computational Scientists
A High Performance SoC: PkunityTM
Setup distribution of N particles
Parallel computing in Computational chemistry
Presentation transcript:

Special-purpose computers for scientific simulations Makoto Taiji Processor Research Team RIKEN Advanced Institute for Computational Science Computational Biology Research Core RIKEN Quantitative Biology Center

Accelerators for Scientific Simulations(1): Ising Machines Ising Machines Delft/ Santa Barbara/ Bell Lab. Classical spin = a few bits: simple hardware m-TIS (Taiji, Ito, Suzuki): the first accelerator that coupled tightly with the host computer. m-TIS I (1987)

Accelerators for Scientific Simulations(2): FPGA-based Systems Splash m-TIS II Edinburgh Heidelberg

Accelerators for Scientific Simulations(3): GRAPE Systems GRAvity PipE: Accelerators for particle simulations Originally proposed by Prof. Chikada, NAOJ. Recently D. E. Shaw group developed Anton, highly-optimized system for particle simulations GRAPE-4 (1995) MDGRAPE-3 (2006)

Classical Molecular Dynamics Simulations Force calculation based on “Force field” Integrate Newton’s Equation of motion Evaluate physical quantities Bonding + - Coulomb van der Waals

Challenges in Molecular Dynamics simulations of biomolecules S. O. Nielsen, et al, J. Phys. (Condens. Matter.), 15 (2004) R481 Femtosecond to millisecond difference =10 12 Strong Scaling Weak Scaling

Scaling challenges in MD 〜 50,000 FLOP/particle/step Typical system size : N= GFLOP/step 5TFLOPS effective performance 1msec/step = 170nsec/day Rather Easy 5PFLOPS effective performance 1μsec/step = 200μsec/day??? Difficult, but important

What is GRAPE? GRAvity PipE Special-purpose accelerator for classical particle simulations ▷ Astrophysical N-body simulations ▷ Molecular Dynamics Simulations J. Makino & M. Taiji, Scientific Simulations with Special-Purpose Computers, John Wiley & Sons, 1997.

History of GRAPE computers Eight Gordon Bell Prizes `95, `96, `99, `00 (double), `01, `03, `06

GRAPE as Accelerator Accelerator to calculate forces by dedicated pipelines Host Computer GRAPE Most of Calculation→ GRAPE Others → Host computer Particle Data Results Communication = O (N) << Calculation = O (N 2 ) Easy to build, Easy to use Cost Effective

GRAPE in 1990s GRAPE-4(1995): The first Teraflops machine Host CPU ~ 0.6 Gflops Accelerator PU ~ 0.6 Gflops Host: Single or SMP

GRAPE in 2000s MDGRAPE-3: The first petaflops machine Host CPU ~ 20 Gflops Accelerator PU ~ 200 Gflops Host: Cluster Host CPU ~ 20 Gflops Accelerator PU ~ 200 Gflops Host: Cluster Scientists not shown but exist

Sustained Performance of Parallel System (MDGRAPE-3) Gordon Bell 2006 Honorable Mention, Peak Performance Amyloid forming process of Yeast Sup 35 peptides Systems with 17 million atoms Cutoff simulations (Rcut = 45 Å) 0.55sec/step Sustained performance: 185 Tflops Efficiency ~ 45 % ~million atoms necessary for petascale

Scaling of MD on K Computer Strong scaling 〜 50 atoms/core ~3M atoms/Pflops 14 1,674,828 atoms Since K Computer is still under development, the result shown here is tentative.

Problem in Heterogeneous System - GRAPE/GPUs In small system ▷ Good acceleration, High performance/cost In massively-parallel system ▷ Scaling is often limited by host-host network, host-accelerator interface Typical Accelerator SystemSoC-based System

Anton D. E. Shaw Research Special-purpose pipeline + General-purpose CPU core + Specialized network Anton showed the importance of the optimization in communication system R. O. Dror et al., Proc. Supercomputing 2009, in USB memory.

MDGRAPE-4 Special-purpose computer for MD simulation Test platform for special-purpose machine Target performance ▷ 10μsec/step for 100K atom system –At this moment it seems difficult to achieve… ▷ 8.6μsec/day (1fsec/step) Target application : GROMACS Completion: ~2013 Enhancement from MDGRAPE-3 ▷ 130nm → 40nm process ▷ Integration of Network / CPU The design of the SoC is not finished yet, so we cannot report performance estimation.

MDGRAPE-4 System 48 Optical Fibers 12 lane 6Gbps Optical 12 lane 6Gbps Electric = 7.2GB/s (after 8B10B encoding) Node (2U Box) Total 64 Nodes (4x4x4) =4 pedestals MDGRAPE-4 SoC Total 512 chips (8x8x8)

MDGRAPE-4 Node

MDGRAPE-4 System-on-Chip 40 nm (Hitachi), ~ 230mm 2 64 force calculation 0.8GHz 64 general-purpose processors Tensilica Extensa 72 lane

SoC Block Diagram

Embedded Global Memories in SoC ~1.8MB 4 Block For Each Block ▷ 128bit X 2 for General- purpose core ▷ 192bit X 2 for Pipeline ▷ 64 bit X 6 for Network ▷ 256bit X 2 for Inter-block GM4 Block 460KB GM4 Block 460KB GM4 Block 460KB GM4 Block 460KB Network 2 Pipeline Blocks 2 GP Blocks

Pipeline Functions Nonbond forces and potentials Gaussian charge assignment & back interpolation Soft-core

Pipeline Block Diagram ~28 stages

Pipeline speed Tentative performance ▷ 8x8 case) 64x0.8=51.2 G interactions/sec 512 chips = 26 T interactions/sec Lcell=Rc=12A, Half-shell atoms 10 5 X 2400/ 26T ~ 9.2 μsec ▷ Flops count ~50 operations / pipeline 2.56 Tflops/chip

General-Purpose Core Tensilica 0.6 GHz 32bit integer / 32bit Floating 4KB I-cache / 4KB D-cache 8KB Local Memory ▷ DMA or PIF access 8KB Local Instruction Memory ▷ DMA read from 512KB Instruction memory Core 4KB Dcache 8KB D-ram 4KB Icache 8KB I-ram Core DMAC Queue IF Integer Floating Queu e PIF Inst- ruction DMAC GP Block Barrier Global Memory Control Processor Instruction Memory

Synchronization 8-core synchronization unit Tensilica Queue-based synchronization send messages ▷ Pipeline → Control GP ▷ Network IF → Control GP ▷ GP Block → Control GP Synchronization at memory – accumulation at memory

Network Latency Current latency ~ 400 ns Kick network interface → Read global memory → Finish transaction of 512B packet to the nearest chip’s global memory Latency in optical modules is not included Larger than we anticipated…

Power Dissipation (Tentative) Dynamic Power (Worst) < 40W ▷ Pipeline~ 50% ▷ General-purpose core~ 17% ▷ Others~ 33% Static (Leakage) Power ▷ Typical ~5W, Worst ~30W ~ 50 Gflops/W ▷ Low precision ▷ Highly-parallel operation at modest speed ▷ Energy for data movement is small in pipeline

Cost Development (including the first system) ▷ ~ 9M$ Reproduction ▷ 2.5 ~ 3M$ (Not sales price) –0.6~1M$SoC –~1M$ Optical components –0.8~1M$Other parts –~0.1M$Host computer

Schedule First signoff ~ 2012 July Chip delivery ~ 2H 2012 System ~ 1H 2013 SDK~ 2H 2012 Software~ 1H 2014 Collaborators welcome!

Reflection Though the design is not finished yet… Latency in Memory Subsystem ▷ More distribution inside SoC Latency in Network ▷ More intelligent Network controller Pipeline / General-purpose balance ▷ Shift for general-purpose? ▷ # of Control GP

What happens in future?

Merit of specialization ( Repeat ) Low frequency / Highly parallel Low cost for data movement ▷ Dedicated pipelines, for example Dark silicon problem ▷ All transistors cannot be operated simultaneously due to power limitation ▷ Advantages of dedicated approach: –High power-efficiency –Little damage to general-purpose performance

Future Perspectives (1) In life science fields ▷ High computing demand, but ▷ we do not have so many applications that scale to exaflops (even petascale is difficult) Requirement for strong scaling ▷ Molecular Dynamics ▷ Quantum Chemistry (QM-MM) ▷ There remain problems to be solved in petascale

Future Perspectives (2) For Molecular Dynamics Single-chip system ▷ >1/10 of the MDGRAPE-4 system can be embedded with 11nm process ▷ For typical simulation system it will be the most convenient ▷ Still network is necessary inside SoC For further strong scaling ▷ # of operations / step / 20Katom ~ 10 9 ▷ # of arithmetic units in system ~ 10 6 /Pflops Exascale means “Flash” (one-path) calculation ▷ More specialization is required

Acknowledgements RIKEN Mr. Itta Ohmura Dr. Gentaro Morimoto Dr. Yousuke Ohno Hitachi Co. Ltd. Mr. Iwao Yamazaki Mr. Tetsuya Fukuoka Mr. Makio Uchida Mr. Toru Kobayashi and many other staffs Hitachi JTE Co. Ltd. Mr. Satoru Inazawa Mr. Takeshi Ohminato Japan IBM Service Mr. Ken Namura Mr. Mitsuru Sugimoto Mr. Masaya Mori Mr. Tsubasa Saitoh