Scalable systems for reservoir modeling on modern hardware platforms Dmitry Eydinov SPE London. November, 24 th 2015.

Slides:



Advertisements
Similar presentations
Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute.
Advertisements

Computer Abstractions and Technology
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Parallel Programming Henri Bal Rob van Nieuwpoort Vrije Universiteit Amsterdam Faculty of Sciences.
GridFlow: Workflow Management for Grid Computing Kavita Shinde.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Clusters Part 2 - Hardware Lars Lundberg The slides in this presentation cover Part 2 (Chapters 5-7) in Pfister’s book.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Computer System Architectures Computer System Software
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Bob Thome, Senior Director of Product Management, Oracle SIMPLIFYING YOUR HIGH AVAILABILITY DATABASE.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Computer Graphics Graphics Hardware
Extracted directly from:
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Conclusions and Future Considerations: Parallel processing of raster functions were 3-22 times faster than ArcGIS depending on file size. Also, processing.
 Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over a network (typically the Internet). 
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. HP Update IDC HPC Forum.
Morgan Kaufmann Publishers
Biryaltsev E.V., Galimov M.R., Demidov D.E., Elizarov A.M. HPC CLUSTER DEVELOPMENT AND OPERATION EXPERIENCE FOR SOLVING THE INVERSE PROBLEMS OF SEISMIC.
 The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Full and Para Virtualization
Virtualization Supplemental Material beyond the textbook.
Outline Why this subject? What is High Performance Computing?
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.
Tackling I/O Issues 1 David Race 16 March 2010.
Background Computer System Architectures Computer System Software.
Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer
ANSYS, Inc. Proprietary © 2004 ANSYS, Inc. Chapter 5 Distributed Memory Parallel Computing v9.0.
1 A simple parallel algorithm Adding n numbers in parallel.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Societal applications of large scalable parallel computing systems ARTEMIS & ITEA Co-summit, Madrid, October 30th 2009.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
HPC need and potential of ANSYS CFD and mechanical products at CERN A. Rakai EN-CV-PJ2 5/4/2016.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Lecture 2: Performance Evaluation
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Super Computing By RIsaj t r S3 ece, roll 50.
Spatial Analysis With Big Data
Constructing a system with multiple computers or processors
Hybrid Programming with OpenMP and MPI
IBM Power Systems.
Vrije Universiteit Amsterdam
Efficient Migration of Large-memory VMs Using Private Virtual Memory
Presentation transcript:

Scalable systems for reservoir modeling on modern hardware platforms Dmitry Eydinov SPE London. November, 24 th 2015.

Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being a simple material balance estimator to full-physics numerical simulators As we follow the development of simulations over time, the models become more demanding and more complex: rock properties, fluids and reservoir description, wells models, surface network, compositional and thermal effects, EORs, etc. Grid dimensions are based on available resources and project time frames Proper uncertainty analysis is often skipped due to limited time.

Grid Resolution Effects Fine (1m x 50m x 0.7 m) Coarse (2m x 50m x 0.7m)

Moore’s Law “The number of transistors in a dense integrated circuit doubles approximately every two years” - Gordon Moore, co-founder of Intel, 1965

Evolution of microprocessors Only the number of transistors/cores continue to rise!

First Serial Multicore CPU’s In old clusters all computational cores are isolated by distributed memory (MPI required). Most of the conventional algorithms are designed based on this paradigm. With the shared memory systems all cores communicate directly, which is significantly faster than communication between the cluster nodes. Simulation software has to take it into account to maximize parallel performance.

7 Climate modeling, weather forecasting Digital content Financial analysis Space technologies Medicine Technical design HPC for Numerical Modeling All industries run massive high-performance computing simulations on a daily basis

In the meantime, in the reservoir simulations…

Desktops and Workstations Shared memory systems: Fast interactions between the cores No need to introduce grid domains The system of equations can be solved directly on the matrix level * Other names and brands may be claimed as the property of others Up to 30MB Shared Cache Intel Xeon Processor E5 v3 4 channels of up to DDR MHz memory Intel Xeon Processor E5 v3 DDR3DDR3 Up to 30MB Shared Cache DDR3DDR3 DDR3DDR3 DDR3DDR3 DDR3DDR3 DDR3DDR3 DDR3DDR3 DDR3DDR3 4 channels of up to DDR MHz memory up to 18 cores per CPU

Shared memory: blocks are selected automatically on the matrix level Non-Uniform Memory Access: memory is allocated dynamically through NUMA Hyperthreading: system threads accessed directly Fast CPU cache: big enough to fit matrix blocks All parts of code are parallel: not just linear solver Special compiler settings The software: for maximum performance the following hardware features are used: “Bandwidth machine“: up to 51GB/s ( ~10 times the Infiniband speed) DDR3 QPI NUMA Desktops and Workstations

High-end Desktops and Workstations Speed-up vs. single core Number of threads 2011: Dual Xeon X5650, (2x6) 12 cores, 2.66GHz, 3 channels DDR MHz (e.g. HP Z800) 2012: Dual Xeon E2680, (2x8) 16 cores, 2.7GHz, 4 channels DDR MHz (e.g. HP Z820) 2013: Dual Xeon E2697v2, (2x12) 24 cores, 2.7GHz, 4 channels DDR MHz (e.g. HP Z820) 2014: Dual Xeon E2697v3, (2x12) 28 cores, 2.6GHz, 4 channels DDR MHz (e.g. HP Z840)

 10-core Xeon E5v2 2.8GHz  8 dual CPU nodes with 160 cores in total (= 8 workstations connected with Infiniband 56Gb/s)  1.024TB of DDR3 1866GHz memory ~ $75K 300 million Models with up to 300 million active grid blocks Parallel speed-up ≈ times Modern HPC clusters are not as complex as space shuttles anymore

Hybrid algorithm. Removing the bottlenecks. Solver MPI OS Threads matrix NUMA cluster network ~ 5GB/s ~ 50GB/s SPE  Simulator solver software integrates both MPI and threads system calls  Node level: the parallelization between CPU cores is done on the level of solver matrix using OS threads  As a result, the number of MPI processes is limited to the number of cluster nodes, not the total number of cores Cluster node with 2 CPUs This removes one of the major performance bottlenecks – network throughput limit!

Model grid domains Suppose we have  Model: 3 phase with 2,5 mln active grid cels  Cluster: 10 nodes x 20 cores = 200 cores in total Conventional MPIMultilevel Hybrid method 200 grid domains exchanging boundary conditions 10 grid domains exchanging boundary conditions

Memory footprint for 8-core/node cluster Hybrid needs 5 times less memory for 64 nodes Number of nodes Total memory used, GB Memory usage

Number of cores Acceleration Xeons X5650Xeons E5-2680v2 Old cluster: 20 dual (12 core) nodes, 40 Xeons X5650, 240 cores, 24GB DDR3 1333MHz, Infiniband 40Gb/s New cluster: 8 dual (20 core) nodes, 16 Xeons E5-2680v2, 160 cores,128GB DDR3 1860MHz, Infiniband 56Gb/s Cluster Parallel Scalability

Top 20 cluster:  512 nodes used  Dual Xeon 5570  4096 cores  DDR3 1333MHz  21.8 million active blocks  39 wells Testing the limits Number of cores Acceleration phase “black oil” – 1328 times From 2,5 weeks to 19 minutes SPE

Xeons E5-2680v2 3.2kW Easy to install – easy to use Bosch TWK kW Tefal FV kW In house clusters: Can be installed in a regular office space Take only 4-6 weeks to build Need air-conditioned room and LAN connection Significantly more economical than 5-10 years ago Xeons X kW

In-house Cluster Setup Network Cluster network Shared storage Cluster nodes Users Head node GUI Data Control Dispatcher GUI

User Interface Job queue management (start, stop, results view) Full graphics simulation results monitoring at runtime (2D, 3D, wells, perforations, 3D streamlines)

Project Workload Typically strongly non-uniform due to decision making cycles in the companies The peaks require significant investment in the computational resources

Amazon Cloud Map Thousands of CPUs/cluster nodes can be accessed in the clouds for a very reasonable price

How does it work in the clouds?  Users choose how many nodes/cores they would like to use  Software get installed automatically in several minutes  Data to be uploaded once in a packed format and then the models can be changes directly in the cloud storage  All files in the cloud are encrypted to ensure the data security  Simulation results are visualized directly on a remote workstation connected to the cluster nodes  When the simulations are complete, the data can be deleted or left in the cloud storage  Users charged just for the time they access the technology

 Three-phase black oil model  Complex geology with active gas cap  Production history – 45 years  Number of producers and injectors – 14,000 (vertical, inclined, horizontal)  2,7 billion tons of oil produced  8+ reservoir volumes injected SPE Case Study #1. Giant Field.

 To select optimal spatial grid resolution 4 grids have been generated  Original model 150m х 150m 7 mil. blocks (4.5 mil. active)  Lateral grid refinement 50m х 50m 70 mil. blocks (40 mil. active)  Vertical grid refinement (4 times) 50m х 50m 280 mil. blocks (160 mil. active)  Vertical grid refinement (10 times) 50m х 50m 700 mil. blocks (404 mil. active) Model 7 million Model 70 million Model 700 million Model 280 million Reduction of block sizes in XY by 3 times Reduction of block sizes in Z by 4 times Reduction of block sizes in Z by 10 times Case Study #1. Giant Field.

The most complex cases were run using a massive cluster: 64 nodes 2 CPU Xeon E5 2680v2 2.8GHz per node 4 channels DDR3 with 128GB 1866MHz per node Local network - FDR Infiniband 56 Gb/s Total: 1280 CPU cores RAM 8.2TB, 200TB of disk space Active blocks Number of well perforations Total memory size Total CPU time for 1280 core cluster 40 mil 0.13 mil120 GB5 hours 30 min 162 mil 0.51 mil561 GB54 hours 04 min 404 mil 1.28 mil1.29 TB490 hours Taking into account the number of active grid blocks, well perforations, and history this is one of the World’s most complex dynamic models Case Study #1. Giant Field.

 The distributions of reservoir pressure in grid blocks located directly beneath the gas cap calculated at the last historic time step are compared  A systematic shift of the reservoir pressure distributions for grid blocks under the gas cap are observed Reservoir pressure (bars) Relative frequency bars shift Case Study #1. Giant Field. Bottom hole pressure (40 mil.) Producers To produce the same historic amount of liquid, for the model with 162 mil. active grid blocks more intense pumping is needed!!! Bottom hole pressure (162 mil.)

 The presence of additional sub-layers in the model with 162 mil. active grid blocks causes reservoir liquids to be produced first from high permeability layers with typically higher bottom hole pressure at producers as compared to 40 mil. model  Then, with more liquids extracted, the production starts to affect layers with lower permeabilities and thus require reduced bottom hole pressure at production wells oil 40 mil. active 162 mil. active water gas Comparison of average pressure dynamics shown separately for different phases Case Study #1. Giant Field. SPE

Case Study #2* Key objective: Maximise the value of the asset to the business. Optimise the development plan with account for uncertainty. Target the P70 value of the NPV. *From “Computer Optimisation of Development Plans in the Presence of Uncertainty” by Dr Jonathan Carter, Head Technology and Innovation Center for Exploration and Production, E.ON

Case Study #2. Conclusions* *From “Computer Optimisation of Development Plans in the Presence of Uncertainty” by Dr Jonathan Carter, Head Technology and Innovation Center for Exploration and Production, E.ON We used about 34,000 simulations, over a three week period. 31 nodes each with 16 cores We estimate that another well known simulator would have needed almost a whole year to complete the same task The final solution obtained is only slightly better than the engineer designed case (about 2%) The total effort was much reduced: Easier to set up 31 models to cover the uncertainty, rather than meetings about what the reference case should look like. Most of the strain was taken by the computer, leaving the engineer free to do other things. The final optimised well placements have some interesting features that challenge the normal design process

Geological variables Dynamic variables Probabilistic forecast with account for uncertaintyIntegrateduncertaintystudy Key objective: Probabilistic production forecast for a big field with account for uncertainty based on the defined development scenario for 25 years period Case Study #3

P10P50 P90 3 various structure models 300 geological Models (100 realizations for each of the structure models) 8100 simulation models 83 history matched colutions Probabilistic forecast Case Study #3

HPC cluster (96 nodes, 1920 cores) – great assistant for developing our good ideas 8100 simulation models for HM cycles in less than two days! Two weeks for the whole scope of work. Case Study #3

Conclusions The standard bottlenecks for parallel performance in the reservoir simulations are mostly on the software side. They can be solved by application of the modern software products that properly handle the modern hardware architecture. Today, the hardware and software technology allows to reach parallel acceleration rates of 100, 300 and even times Technically the industry has everything to move towards o finer geological grids o uncertainty assessment workflows without significant growth in the project time frames and investment and dramatic investment in the computational resources

Thank you!

Simulations on GPU Deceleration with respect to 2 Xeon E The matrices are sparse, lots of empty spaces and therefore it is rather difficult to enable all kernels of GPUs… Memory and I/O of Xeon E5 seem to handle it much better Matrix number Deceleration CUDA TESLA did well on one!! matrix