NSF CI U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale… Thomas E. Cheatham III Associate Professor tec3@utah.edu Departments of Medicinal Chemistry and of Pharmaceutics and Pharmaceutical Chemistry, College of Pharmacy, University of Utah NSF TeraGrid Science Advisory Board NSF LRAC/MRAC allocations panel (~2002-2008), chair NSF LRAC award since ~2001; ||-computing since 1987; ~17 M hours this year on local and NSF machines U Utah CI Council; Information Technology Council; CHPC

eScience = cyberinfrastructure (???) "The term "e-Science" denotes the systematic development of research methods that exploit advanced computational thinking“ Professor Malcolm Atkinson, e-Science Envoy. “Cyberinfrastructure” consists of computing systems, data storage systems, data repositories and advanced instruments, visualization environments, and people, all linked together by software and advanced networks to improve scholarly productivity and enable breakthroughs not otherwise possible. EDUCAUSE, Campus Cyberinfrastructure workgroup

“If you’re a scientist, talk to a computer scientist about your challenges, and vice versa.” i.e. clustering, data handling, …

How do drugs bind and influence structure (and dynamics)? the tool: biomolecular simulation

The tool: Biomolecular simulation energy vs. sampling

What is bio-molecular simulation? “physics” based atomic potential—the force field— tuned for proteins, nucleic acids and their surroundings (solvent, ions, drugs, …) ++ -- bonds electrostatics angles van der Waals dihedrals There are many force fields, each with distinct performance characteristics…

What is bio-molecular simulation? “physics” based atomic potential—the force field— tuned for proteins, nucleic acids and their surroundings (solvent, ions, drugs, …) codes and methods developed over the past ~40 yrs by various teams including centers, labs and industry –80’s vectorization + early parallel architectures –90’s shared memory and distributed memory parallelized –00’s special purpose hardware and optimized codes –AMBER, CHARMM, Encad, NAMD, Desmond, GROMOS, Gromacs, LAMMPS, …

de novo protein folding, structure prediction computer aided drug design design of novel materials / properties multi-scale modeling simulation of time scales that are approaching relevant time scales CAPTOPRIL: ACE inhibitor (antihypertensive) VIRACEPT: HIV protease inhibitor (AIDS therapy). Agouron. CRIXIVAN: HIV protease inhibitor (AIDS therapy). Merck. VIAGRA: cGMP PD type 5 inhibitor (impotence). Pfizer. ZOMIG: trypamine receptor antagonist (migraine) Zeneca TEVETEN: Angiotensin II receptor antagonist (hypertension) TRUSOPT: carbonic anhydrase inhibitor (glaucoma) ARICEPT: AChE inhibitor (alzheimers/dementia) COZAAR: angiotensin II receptor antagonist (hypertension) NOROXIN: inhibits bacterial DNA synthesis (antibacterial)

- Do the simulations model reality? - How can we assess & validate the results? - Can simulations provide predictive insight? - How can we improve the applied methods? general goals of bio-molecular simulation research

- Do the simulations model reality? - How can we assess & validate the results? - Can simulations provide predictive insight? - How can we improve the applied methods? - How can we facilitate the simulation experiment? - How can we better disseminate the data? - How can we use the emerging machines? general goals of bio-molecular simulation research computational science (???)

Experience required? structural biology statistical mechanics biophysics / computational chemistry pharmacy / organic chemistry UNIX / system administration coding ability (Fortran90, scripting, …) parallel computing data handling, analysis, viz B.A. Chemistry B.A. Mathematics & Comp. Sci. PhD Pharm Chem Programmer / Analyst, 2 yrs + NSF centers Still learning… interdisciplinary teams?

The power of the TeraGrid (aka metacenter, PACI, xD: NSF centers) education / training –CM-2a, CM-5, MasPar training, ~1989 –Summer Institute in Supercomputing at PSC, 1992 –Scientific Computing Institute at Los Alamos, 1992 vectorization, basic concepts of shared vs. distributed memory –Heterogeneous computing at PSC, 1994 shared memory + MPI (+ PVM, TCGMSG, …) –Shared memory and MPI parallelized AMBER released (PSC, SGI) –AMBER workshops (as teacher), 1996 & 1998 outreach –center brochures, literature, WWW pages, joint publications –Computerworld Smithsonian Awards Finalist (with PSC, UCSF, NIEHS) cycles!!! –friendly user status, consultants, helpline, porting guides Allocations: ~100K in 1995, ~1M in 2002, ~10M in 2009, ~14M in 2010, …

$$$: 1R01-GM081411-01: Biomolecular simulation for the end-stage refinement of nucleic acid structure 1R01-GM079383-01: “AMBER force field consortium” Research funding focuses on NIH mission (basic science + health relevance) Curious trends, barriers and limitations in the field… - Funding is results driven, little reward for software optimization - NIH does not really fund (or support) supercomputing / CI  - …yet NIH funds the bulk of biomolecular simulation research (?)

$$$: 1R01-GM081411-01: Biomolecular simulation for the end-stage refinement of nucleic acid structure 1R01-GM079383-01: “AMBER force field consortium” Research funding focuses on NIH mission (basic science + health relevance) Curious trends, barriers and limitations in the field… - Funding is results driven, little reward for software optimization - NIH does not really fund (or support) supercomputing / CI  - …yet NIH funds the bulk of biomolecular simulation research (?) - Student PhD’s tend to be in “chemistry” (no expertise in computational science) - Codes are complex, legacy, and evolving…

Curious trends, barriers and limitations in the field… NSF does not directly fund most biomolecular simulation research few agencies or companies support biosimulation code development bulk of cycles in field from NSF centers, then DOE. –10% cap on NIH research vs. inter-agency cooperation? PHY 18% AST 14% CHE 11% DMR 8% DMS 0% BIO 30% ENG 10% CIS 3% GEO 2% SBE 0% IND 4%

Curious trends, barriers and limitations in the field… NSF does not directly fund most biomolecular simulation research few agencies or companies support biosimulation code development bulk of cycles in field from NSF centers, then DOE. –10% cap on NIH research vs. inter-agency cooperation? Threats: -Without NSF cycles and the TeraGrid/xD the field of biomolecular simulation would stagnate. PHY 18% AST 14% CHE 11% DMR 8% DMS 0% BIO 30% ENG 10% CIS 3% GEO 2% SBE 0% IND 4%

Curious trends, barriers and limitations in the field… NSF does not directly fund most biomolecular simulation research few agencies or companies support biosimulation code development bulk of cycles in field from NSF centers, then DOE. –10% cap on NIH research vs. inter-agency cooperation? Threats: -Without NSF cycles and the TeraGrid/xD the field of biomolecular simulation would stagnate. -…we are spending more and more of our time running simulations, managing workflow, transferring data, i.e. doing computational science PHY 18% AST 14% CHE 11% DMR 8% DMS 0% BIO 30% ENG 10% CIS 3% GEO 2% SBE 0% IND 4%

- simulations run for ~6 months, 16-32-way parallel, batch - < 100 GB data, run remotely, stored and analyzed locally - analysis is standard (key values vs. time) - required advances (completed): - methods improvement (PME electrostatics) - optimized codes for shared memory, MPI, … - development of general purpose analysis utilities “ ptraj ” MD simulations ~500ps – 3 ns ~1994-1997 bio-molecular simulation at the meta-scale

bio-molecular simulation at the tera+ -scale tetraloop receptor 5 simulations @ ~200 ns cyp-P450 2B4 8 simulations @ ~150 ns DNA minor groove binders 7 drugs, 2 binding modes, 4 sequences @ ~50 ns - simulations run for ~6 months, 16-1K -way parallel, batch - ~1-5 TB per set, run remotely, stored and analyzed locally - analysis has become rate limiting; data too large/slow…

Data is complex: How to simplify? (don’t throw out baby with bathwater) vast time/size scales; granularity scales

…if we know what we want to see, analyzing and visualizing is easy… …and tools are available

force fields vs. sampling we (likely) have systematic problems with structure or converge to incorrect structure we (likely) get trapped in a meta-stable conformations energy reaction coordinate Computer power? the good the bad

David E. Shaw: DESRES 16 microseconds / day !!!

Funny things can and do happen… & we’re experiencing serious data overload… 500 nanosecond simulation of a DNA duplex using generalized Born implicit solvation

Some problems (~2000-2008) K+, Cl-, Mg2+ crystal? Phased A-tract burrowing Mg 2+ ion?

Joung / Cheatham, JPCB 113, 13279 (2009)

How about long DNA simulation? > 500 ns on DAPI bound DNA duplexes Cornell et al. force field. site E complex E DNA+20w DAPI GG  G * ATTG -4085.0-3915.6-149.7-19.7-2.4 AATT-4086.4-3917.9-149.7-18.8 ATTG-4085.7-3916.4-149.7-19.6+1.0 AATT-4087.5-3917.2-149.7-20.6 ATTG-4087.2-3918.7-149.7-18.8+1.4 AATT-4092.8-3922.9-149.7-20.2 J. Amer. Chem. Soc. (2003) Špackova et al. (Cheatham, Sponer) * Includes entropic differences

(CC AATT GG) 2 GG at ~350 ns (two separate simulations) …DNA duplex structure goes away and never comes back… 

dynamics / flexibility > 1 conformation structure is (very) sensitive to the surroundings un-validated force fields very few drug bound structures… RNA is more difficult……but also much more interesting!

8 9 10 7 STATISTICS d109 DISTANCE between atoms :9@H5 & :7@H1' AVERAGE: 6.8887 (2.7204 stddev) INITIAL: 4.2624 FINAL: 6.5966 NOE SERIES: S < 2.9, M < 3.5, w < 5.0, blank otherwise. |SMMMMWMMWMWW W | NOE < 4.30 for 21.86% of the time NOE < 4.80 for 24.83% of the time 6.5 ------------------------------------------------------- %occupied | 0.7 | 13.1 | 9.2 | 6.2 | 10.0 | 60.7 | “Long” MD (~20-100 ns): restraints progressively violated… U8 U U A7 C A A Ψ G C A U C G U A U

peta- to exa-scale worries…

MD codes scale to ~16-256 processors @ > 70% efficiency ► getting to 1000 is do-able (Bob Duke, UNC; Schulten, UIUC; DE Shaw; E. Lindahl) ► getting to 10,000 is hard (PetaApps). ► getting to 100,000: ??? (ensemble methods) not easily with embarrassingly NOT parallel MD Petascale science: scaling  It is hard to ||-ize time

trajin traj.1.gz trajin traj.2.gz trajin traj.3.gz trajout traj.strip center :1-10 mass origin image origin center center :1-20 mass origin image origin center rms first mass out rms.dat :1-20 distance d1 out d1.dat :1 :10 grid wat.xplor 100 0.5 100 0.5 100 0.5 :3-8,13-18 strip :WAT average pdb average.pdb :1-20..the standard means of analysis is breaking down… data management & simulation workflow are limiting…

trajin traj.1.gz trajin traj.2.gz trajin traj.3.gz trajout traj.strip center :1-10 mass origin image origin center center :1-20 mass origin image origin center rms first mass out rms.dat :1-20 distance d1 out d1.dat :1 :10 grid wat.xplor 100 0.5 100 0.5 100 0.5 :3-8,13-18 strip :WAT average pdb average.pdb :1-20..the standard means of analysis is breaking down… data management & simulation workflow are limiting… New modes of operation: ENSEMBLES - replica-exchange - path integral - EVB -  G simulations - NEB / path sampling - meta-dynamics More data, more complicated workflow, … Essentially a set of loosely coupled 16-1K processor jobs Examples: two  G states, 20 windows = 40 * 20 temperatures = 800 instances 256 frames on a reaction path * 16 beads per particle * … =

tetraloop receptor 5 simulations @ 200 ns > 1 TB of data Petascale science: the problem will only get worse! What if we can run 1000x longer? …or 10x bigger for 100x longer?

tetraloop receptor 5 simulations @ 200 ns > 1 TB of data What if we can run 1000x longer? …or 10x bigger for 100x longer? > 1000 TB of data …factor of 10: OK …factor of 100: hard …factor of 1000: ??? …more and more time is spent moving data / managing simulations; less time spent doing science… Petascale science: the problem will only get worse!

- Do not move the data (?) - Tiered resources - Persistent storage - Re-running the simulations Solutions? Analysis “on the fly…” [ & more coarse-grained sampling ] + workflow tools for ensembles Petascale science: the problem will only get worse! …what will we miss? Can we only get low hanging fruit?

Hindrances: Codes have become “simpler” and will need to be restructured. intra-core vs. intra-node vs. inter-node vs. cpu type We want to retain high precision / accuracy. We want to be able to enable new methods (with ease). ( Force fields are not yet up to the challenge!!! ) Petascale science: Worries as we move forward…

What we need (data/workflow-centric) is: …a means to speed up & enable science… …a means to interact with our simulations: “steer”, inspect, share, search, understand, expose (hidden correlations, meaning, data) …a means to manage large simulation workflows… disseminate, enable re-use How do we make TB’s of raw data available? - remote references to data - partial analysis, on the fly analysis - history, memory, or provenance - standards (?) - annotation - automation – workflow!

- Educated people / teams (multidisciplinary, experts) - Software / middleware (workflow, provenance, data handling) - Software – code optimization / parallelization / extensions - Ease of use - Means to analyze data, distribute data, preserve/archive data… - More cycles, more disk space, … - More science, less computational science My world reinforces Seidel’s CI crises. We need:

Hepatitis C virus IRES IRES = internal ribosome entry site (translation initiation in middle of mRNA)

Why is failure important to learn about? These methods are in wide use worldwide: - CADD - Structure Prediction - Mechanisms - Molecular association Most people do not have 15M hour allocations Data from failure can be reused! ~500 active NIH grants with “molecular dynamics” in abstract!

office CHPC home NIH RO1-GM081411-01A1 NIH RO1-GM079383-01A1 NSF TG-MCA01S027

NSF CI U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

Similar presentations

Presentation on theme: "NSF CI U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NSF CI U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

Similar presentations

Presentation on theme: "NSF CI U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…"— Presentation transcript:

Similar presentations

About project

Feedback