ANTON D.E Shaw Research. Force Fields: Typical Energy Functions Bond stretches Angle bending Torsional rotation Improper torsion (sp2) Electrostatic interaction.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Lecture 6: Multicore Systems
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Computer Organization and Architecture
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
Device Tradeoffs Greg Stitt ECE Department University of Florida.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
New Architectures for a New Biology Martin M. Deneroff D. E. Shaw Research, LLC.
Molecular Dynamics: Review. Molecular Simulations NMR or X-ray structure refinements Protein structure prediction Protein folding kinetics and mechanics.
A Digital Laboratory “In the real world, this could eventually mean that most chemical experiments are conducted inside the silicon of chips instead of.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
10/21/20091 Protein Explorer: A Petaflops Special-Purpose Computer System for Molecular Dynamics Simulations Makoto Taiji, Tetsu Narumi, Yousuke Ohno,
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Seven Minute Madness: Special-Purpose Parallel Architectures Dr. Jason D. Bakos.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.
ECE 526 – Network Processing Systems Design
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Prince Sultan College For Woman
Module I Overview of Computer Architecture and Organization.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
‘Tis not folly to dream: Using Molecular Dynamics to Solve Problems in Chemistry Christopher Adam Hixson and Ralph A. Wheeler Dept. of Chemistry and Biochemistry,
Algorithms and Software for Large-Scale Simulation of Reactive Systems _______________________________ Ananth Grama Coordinated Systems Lab Purdue University.
High-Performance Computing 12.1: Concurrent Processing.
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Molecular Dynamics Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
P ARALLELIZATION IN M OLECULAR D YNAMICS By Aditya Mittal For CME346A by Professor Eric Darve Stanford University.
Anton Supercomputer Brandon Dean 4/28/15. History Named after Antonie van Leeuwenhoek – “father of microbiology” Molecular Dynamics (MD) simulations were.
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Protein Explorer: A Petaflops Special Purpose Computer System for Molecular Dynamics Simulations David Gobaud Computational Drug Discovery Stanford University.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
M U N -March 10, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording March 10, 2005.
EE3A1 Computer Hardware and Digital Design
COMPUTER ORGANIZATION AND ASSEMBLY LANGUAGE Lecture 19 & 20 Instruction Formats PDP-8,PDP-10,PDP-11 & VAX Course Instructor: Engr. Aisha Danish.
Molecular Modelling - Lecture 2 Techniques for Conformational Sampling Uses CHARMM force field Written in C++
ANTON D.E Shaw Research.
Anton, a Special-Purpose Machine for Molecular Dynamics Simulation By David E. Shaw et al Presented by Bob Koutsoyannis.
Outline Why this subject? What is High Performance Computing?
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.
Molecular dynamics (MD) simulations  A deterministic method based on the solution of Newton’s equation of motion F i = m i a i for the ith particle; the.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Conclusions on CS3014 David Gregg Department of Computer Science
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Implementation of the TIP5P Potential
The University of Adelaide, School of Computer Science
Architecture & Organization 1
Morgan Kaufmann Publishers
Introduction to Reconfigurable Computing
Architecture & Organization 1
Parallel computing in Computational chemistry
Presentation transcript:

ANTON D.E Shaw Research

Force Fields: Typical Energy Functions Bond stretches Angle bending Torsional rotation Improper torsion (sp2) Electrostatic interaction Lennard-Jones interaction

Molecular Dynamics Solve Newton ’ s equation for a molecular system:

Integrator: Verlet Algorithm Start with {r(t), v(t)}, integrate it to {r(t+  t), v(t+  t)}: {r(t), v(t)} {r(t+  t), v(t+  t)} The new position at t+  t: Similarly, the old position at t-  t: (1) (2) Add (1) and (2): Thus the velocity at t is: (3) (4)

Molecular Dynamics Iterate... and iterateIterate... and iterate Integrate Newton’s laws of motion

Two Distinct Problems Problem 1: Simulate many short trajectories Problem 2: Simulate one long trajectory

Simulating Many Short Trajectories  Can answer surprising number of interesting questions  Can be done using –Many slow computers –Distributed processing approach –Little inter-processor communication  E.g., Pande’s Folding at Home project

Simulating One Long Trajectory  Harder problem  Essential to elucidate many biologically interesting processes  Requires a single machine with –Extremely high performance –Truly massive parallelism –Lots of inter-processor communication

DESRES Goal  Single, millisecond-scale MD simulations (long trajectories) –Protein with 64K or more atoms –Explicit water molecules  Why? –That’s the time scale at which many biologically interesting things start to happen

Image: Istvan Kolossvary & Annabel Todd, D. E. Shaw Research Protein Folding

What Will It Take to Simulate a Millisecond?  We need an enormous increase in speed –Current (single processor): ~ 100 ms / fs –Goal will require < 10 m s / fs  Required speedup: > 10,000 x faster than current single-processor speed ~ 1,000x faster than current parallel implementations  Can’t accept >10,000x the power (~5 Megawatts)!

What Takes So Long?  Inner loop of force field evaluation looks at all pairs of atoms (within distance R)  On the order of 64K atoms in typical system  Repeat ~10 12 times  Current approaches too slow by several orders of magnitude  What can be done?

Parallelization (getting an idea of the level of computation needed) For every time step, every atom must communicate within its cutt-off radius with every other atom. 2) A lot of inter-processor communication that can be scaled well is needed. MD Simulator Requirements

Parallelization (getting an idea of the level of computation needed) Whole System is broken down into boxes (processing nodes) Each node handles the bonded interactions within NT method for non-bonded interactions (much more common). NT method for Atom Migration MD Simulator Requirements

1) Need a huge number of arithmetic processing elements 2) A lot of inter-processor communication that can be scaled well is needed. 3) Memory is not an issue –With 25,000 atoms (64bytes each) total=1.6MB over 512 nodes =3.2KB/node which is < most L1 Why Specialized Hardware? Memory Communication Computation Needs

Consider Moore’s Law on 10X improvement in 5 years vs. Anton’s 1000X in 1 year. Can great discoveries wait? Can use custom pipelines with more precision, increased datapath logic speed, over less silicon area. Have Tailored ISA’s for geometric calculations+ Programmability for accommodating various force fields and integration algorithms Dedicated memory for each particle to accumulate forces Why Specialized Hardware? Memory Communication Computation Needs

ANTON Strategy  New architectures –Design a specialized machine –Enormously parallel architecture –Based on special-purpose ASICs –Dramatically faster for MD, but less flexible –Projected completion: 2008  New algorithms –Applicable to Conventional clusters Our own machine –Scale to very large # of processing elements

Interdisciplinary Lab Computational Chemists and Biologists Computer Scientists and Applied Mathematicians Computer Architects and Engineers

Alternative Machine Architectures  Conventional cluster of commodity processors  General-purpose scientific supercomputer  Special-purpose molecular dynamics machine

Conventional Cluster of Commodity Processors  Strengths: –Flexibility –Mass market economies of scale  Limitations –Doesn’t exploit special features of the problem –Communication bottlenecks Between processor and memory Among processors –Insufficient arithmetic power

General-Purpose Scientific Supercomputer  E.g., IBM Blue Gene  More demanding goal than ours –General-purpose scientific supercomputing –Fast for wide range of applications  Strengths: –Flexibility –Ease of programmability  Limitations for MD simulations –Expensive –Still not fast enough for our purposes

Anton: Special-Purpose MD Machine  Strengths: –Several orders of magnitude faster for MD –Excellent cost/performance characteristics  Limitations: –Not designed for other scientific applications They’d be difficult to program Still wouldn’t be especially fast –Limited flexibility

Anton System-Level Organization  Multiple segments (probably 8 in first machine)  512 nodes (each consists of one ASIC plus DRAM) per segment –Organized in an 8 x 8 x 8 toroidal mesh  Each ASIC equivalent performance to roughly 500 general purpose microprocessors –ASIC power similar to a single microprocessor

3D Torus Network

Why a 3D Torus?  Topology reflects physical space being simulated: –Three-dimensional nearest neighbor connections –Periodic boundary conditions  Bulk of communications is to near neighbors –No switching to reach immediate neighbors

Source of Speedup on Our Machine  Judicious use of arithmetic specialization –Flexibility, programmability only where needed –Elsewhere, hardware tailored for speed Tables and parameters, but not programmable  Carefully choreographed communication –Data flows to just where it’s needed –Almost never need to access off-chip memory

Two Subsystems on Each ASIC Specialized Subsystem Flexible Subsystem  Programmable, general-purpose  Efficient geometric operations  Modest clock rate  Pairwise point interactions  Enormously parallel  Aggressive clock rate

 28 Anton 33M gate ASIC Two computational subsystems connected by communication ring Hardware datapaths compute over 25 billion interactions/s Full machine has 512 ASICs in a 3D torus 13 embedded processors

Where We Use Specialized Hardware Specialized hardware (with tables, parameters) where: Inner loop Simple, regular algorithmic structure Unlikely to change Examples: Electrostatic forces Van der Waals interactions

Example: Particle Interaction Pipeline (one of 32)

–Executes Non-bonded MD interaction calculations (Charge Spreading & Force Interpolation) –Accumulates forces on each particle as data streams through. –ICB Controls flow of data through the HTIS, programmable ISA extensions, acts as a buffering, pre-fetching, synchronization, and write back controller High-Throughput Interaction Subsystem

Array of 32 Particle Interaction Pipelines

Advantages of Particle Interaction Pipelines  Save area that would have been allocated to –Cache –Control logic –Wires  Achieve extremely high arithmetic density  Save time that would have been spent on –Cache misses, –Load/store instructions –Misc. data shuffling

Where We Use Flexible Hardware –Use programmable hardware where: Algorithm less regular Smaller % of total computation -E.g., local interactions (fewer of them) More likely to change –Examples: Bonded interactions Bond length constraints Experimentation with -New, short-range force field terms -Alternative integration techniques

Forms of Parallelism in Flexible Subsystem  The Flexible Subsystem exploits three forms of parallelism: –Multi-core parallelism (4 Tensilicas, 8 Geometry Cores) –Instruction-level parallelism –SIMD parallelism – calculate on 3D and 4D vectors as single operation

Overview of the Flexible Subsystem GC = Geometry Core (each a VLIW processor)

Geometry Core (one of 8; 64 pipelined lanes/chip)

But Communication is Still a Bottleneck  Scalability limited by inter-chip communication  To execute a single millisecond-scale simulation, –Need a huge number of processing elements –Must dramatically reduce amount of data transferred between these processing elements  Can’t do this without fundamentally new algorithms: –A family of Neutral Territory (NT) methods that reduce pair interaction communication load significantly –A new variant of Ewald distant method, Gaussian Split Ewald (GSE) which simplifies calculation and communication for distant interactions –These are the subject of a different talk.

 39 Anton in Action

500X NAMD80-100X Desmond 100X Blue Matter Simulation Evaluations

GPU+FPGA ??? GPU 6*GDDR5 FPGA HIGH SPEED SERIAL I/O UP TO 2 Tbit/S LVDS FFT and LJ 16*PCIe