Reservoir characterization workflow optimization on AWS

By Amazon Web Services

January 28, 2021

Abstract

Reservoir characterization in the Oil and Gas industry is a process of building a mathematical model of the earth’s subsurface. The model is used to develop a better understanding of the subsurface, improve production profitability, and manage development and exploration risks. Application of a sophisticated geostatistical Markov Chain Monte Carlo algorithm allows for the integration of disparate sources of geological and geophysical data that is critical for successful reservoir characterization. The statistical nature of the algorithm requires generating many independent realizations from the geostatistical model. These realizations can be computed in parallel, but often require extensive amount of compute and storage resources resulting in a substantial upfront investment in capital expense.

The AWS cloud offers a viable, cost-efficient option for deploying a massively scalable HPC architecture.

In this article, we discuss how geoscientists benefit from the cloud elasticity to optimize their workflows when working with models that are getting larger in size and complexity.

The AWS cloud elasticity helps geoscientists to work with geostatistical models of increasing size and complexity.

Introduction

Reservoir characterization in the Oil and Gas industry is a process of building a mathematical model of the earth’s subsurface. This model is used to develop a better understanding of the subsurface, improve production profitability, and manage development and exploration risks. A reservoir characterization that achieves the study objectives is produced by a confluence of factors. These factors rely on sophisticated modeling techniques. The modeling process requires an integrated approach that incorporates multi-disciplinary empirical observations of geological and geophysical nature. The observations typically come with a degree of uncertainty that could be due to measurement noise, poor data quality, data scarcity, and many other factors that complicate empirical data collection in the field. Geoscientists recognize the importance of uncertainty estimation and quantification for mitigating reservoir exploration and development risks. For this reason, geostatistical methods that are based on Bayesian inference are particularly suitable for reservoir characterization. A geostatistical model of the earth’s subsurface is probabilistic in nature and is represented by a probability distribution. The practitioner studies the distribution by obtaining realizations from the model. The procedure is conceptually similar to tossing a coin multiple times to estimate its fairness. While conceptually simple, obtaining a realization from a geostatistical model is far from trivial. The procedure may call for developing a Markov Chain Monte Carlo (MCMC) algorithm[1] that invariably raises the topic of a significant investment in computing power if the results are to be delivered on time. Figure 1 shows examples of highly detailed geostatistical models.

Figure 1 Highly detailed geostatistical models honoring Oil Water Contact: (a) Facies – Green: Shale; Blue: Brine Sand; Yellow: Low Quality Oil Sand; Pink: Mid Quality Oil Sand; Brown: High Quality Oil Sand. (b) Porosity – Gold: 0.28; Blue: 0.01. (c) Volume of Clay – Gold: 0.0; Blue: 0.7. (d) Water Saturation– Gold: 0.0; Black: 1.0.

Computational Challenges

Generating a geostatistical realization is a very compute-intensive process. To begin with, the dimensionality of the model is a challenge. Geoscientists work with datasets that include billions of empirical observations. The amount of input data is only expected to grow over time due to technological advances in data acquisition and storage solutions. In addition to that, there is a clear trend to utilize models with increasing structural complexity. The requirement to produce predictive sub-surface reservoir characterizations at higher levels of resolution is also on the rise. To satisfy these trends and influence business outcomes in a timely fashion, geoscientists need access to virtually unlimited infrastructure (compute, storage) to run their simulations on a pay-as-you-go model.

Another computational challenge is related to the storage solution. A geostatistical study relies on multiple realizations to infer statistically meaningful information. The realizations are independent of each other, thereby by making the problem inherently computationally scalable. At the same time, multiple realizations are also the reason why a geostatistical study produces an order of magnitude more output data than input data. Therefore, parallel computations easily create a bottleneck in the underlying storage solution. Parallel file systems can alleviate this challenge, but it requires an additional overhead to install and manage them locally. There is yet another storage-related factor that comes into play, namely a paradigm shift that is often referred to as “bringing compute to the data”. Cloud-based storage solutions have progressively gained their market share due to their advantageous price, performance, and security characteristics. One important implication of this trend is that the accompanying compute infrastructure needs to migrate with the data to the cloud.

AWS Elastic Services for HPC

AWS offers elastic and scalable cloud infrastructure to run your HPC applications. With virtually unlimited capacity, engineers, researchers, and HPC system owners can innovate beyond the limitations of on-premises HPC infrastructure. AWS HPC services include Amazon EC2 (Elastic Compute Cloud), storage (Amazon S3, FSX for Lustre, etc.), networking (Elastic Fabric Adapter), orchestration (AWS Batch), and visualization (NICE DCV). Many services for implementing HPC workflows. AWS Batch[2], Amazon FSx for Lustre[3], S3[4] are building blocks for even the most demanding HPC applications. The services are readily available on a pay-as-you-go basis with no long-term commitments.

Our deployment architecture follows the AWS HPC reference and production architectures which also proved to be effective at solving compute-intensive Machine Learning (ML) problems. (see Figure 2).

Figure 2: The solution architecture

Next, we would like to highlight factors that are key to the robustness and effectiveness of our solution.

Automation and Programmability: HPC as Code

We use the Python SDK [5]for AWS to manage cloud resources. Automation of cloud deployment operations significantly improves the end-user experience. With programmatically implemented workflows, we strive to minimize if not eliminate any manual user interaction with the AWS portal or services.

Automation is also critical for making the solution cost-effective. It starts with optimal resource selection. For example, the geoscientist doesn’t have to decide upfront on the parameters for a batch compute environment. The software configures batch jobs on the fly with the CPU and RAM requirements based on the estimated model complexity (which changes with a simulation scenario). The Batch compute environment uses the job parameters to provision EC2 instances accordingly. For these types of workloads, which require a higher vcpu/mem to core ratio, M5 instance family was chosen to achieve the best price-performance.

Without going into the mathematical details, we just mention some factors that influence the multi-threaded scalability. Firstly, the algorithm includes several mathematical constraints that guarantee geostatistical continuity of computed properties. Secondly, the algorithm is inherently multiscale, which means that it includes iterations over the data at different levels of resolution. Finally, the algorithm applies domain decomposition, when needed, to guarantee that the calculation fits into RAM.

As a consequence, the multi-threaded component scales well if the computational domain is sufficiently large and the domain decomposition is avoided. Hence is the idea to automate the procedure by programmatically choosing an EC2 instance type with a sufficiently large amount of RAM.

As a side note for CGG’s workload, we noticed performance degradation when using non-dedicated compute nodes for our jobs.

The ”scale on-demand” principle also applies to the selection of the storage option. Figure 3 shows the results for a benchmark test that was conducted with a geostatistical model of medium size and complexity. The horizontal axis represents the number of realizations that were executed in parallel. The vertical axis is the benchmark run time in seconds. The benchmark compares two storage options: AWS Elastic File System[6] (EFS) is shown in blue and FSx for Lustre Scratch SSD 3600 GB is in orange. The important observation from this graph is the scalability trend of the benchmark run time. In this test, EFS fails to provide enough I/O throughput when the number of parallel computations increases. This is not to diminish the value of EFS as a storage solution. The point is that EFS turned out to be a suboptimal choice for this HPC workload, even though it is designed to provide massively parallel access to EC2 instances.

Figure 3: Scalability of storage options for a model of medium size and complexity

Figure 3 shows the results for the above benchmark test with a larger model. Here, we compare scalability across FSx for Lustre file systems of different capacities and throughputs. FSx for Lustre provides scale-out performance that increases linearly with a file system’s size. We would like to emphasize the importance of choosing a scalable storage solution for an I/O intensive HPC workload. A serious I/O bottleneck may significantly degrade run time performance, which invariably leads to a substantially higher computational cost, where the user ends up paying for idle CPU time while waiting on the I/O to complete. (As a point of reference, 30 m5.12xlarge instances utilize 1440 vCPU cores).

Figure 4: scalability of storage options for a model of large size and complexity

Another key to managing pay-per-use cost is the automation of the cloud resource lifecycle. Instead of asking the user to manage it manually, we destroy the resource automatically when it is no longer needed. A cloud deployment execution is implemented as a state machine (refer to Figure 4) that is written in the ASL States language[7] The state machine and its components, such as Lambda functions and execution policies, form a CloudFormation stack[8]. The stack is created programmatically when the user starts a deployment on the client machine. Resource management via CloudFormation is key to provisioning HPC infrastructure as code.

The AWS Step Functions[9] service manages executions of the state machine. A state machine can be shared across simultaneous deployments by running parallel executions. Our state machine execution is idempotent, which makes it possible to restart it. With potentially long execution times, the idempotence is critical for working around the state machine execution quotas[10].

AWS Step Functions simplify our deployment architecture and contribute significantly to the robustness of its operation. Firstly, the service reduces the operational burden by being serverless. As a general rule of thumb, we want native cloud services to manage the infrastructure for us. Secondly, Step Functions make it easy to orchestrate a multi-stage workflow. Our geostatistical computation includes three major stages of operation: 1) a preparation stage that performs miscellaneous model initialization tasks, 2) a core computational stage that is most time and compute intensive, and 3) a post-processing stage that computes multi-realization statistics and quality control metrics. The stages are executed sequentially. Each stage requires a separate configuration for AWS Batch. In addition to that, a data repository task must be executed after each stage, which is critically important for persisting the results to a permanent S3 storage. Finally, automatic destruction of the Fsx for Lustre file system after completing the stages is also crucial for managing pay-per-use cost. Parts of the flow that involve operations with the file system need to be particularly robust in the presence of run-time errors in other tasks. We rely on the built-in error handling capabilities of AWS Step Functions to guarantee robustness of the state machine execution in that respect.

 

Figure 5 The deployment state machine graph

Fault-tolerant execution is also key to cost-effectiveness. The geostatistical computation may take tens of hours depending on the model complexity and number of realizations. We implement a batch job as a computational graph of subtasks. A subtask introduces a logical place for a “checkpoint” that captures and stores the intermediate computation results. This mechanism makes it possible to restart a batch job automatically from the most recently saved checkpoint and also opens possibilities for further cost reduction by using Amazon EC2 Spot capacity. EC2 Spot allows you to access spare compute capacity at up to a 90% over On-Demand pricing. This feature works in conjunction with the AWS Batch Job Retries functionality. Figure 6 is a conceptual illustration where the second job attempt restarts a failed job with four subtasks from checkpoint number 3.

Figure 6: Fault-tolerant execution

Elastic and cost-effective services

The main purpose of elastic computing is to quickly adapt to workload demands driven by business needs. AWS Batch autoscaling makes thousands of CPUs available in a matter of minutes. Batch Compute Environment automatically adjusts its capacity based on monitoring of the Job Queue. A variety of EC2 instances are readily available to accommodate batch jobs with the most demanding requirements in terms of memory and the number of CPUs. This is a game changer for CGG and its customers. Geoscientists can now rent a super computer at a fraction of the cost compared to deploying an on-premises cluster and complete these workloads within a few hours as opposed to running for few days to weeks on limited on-premises infrastructure. A quantitative change in the availability of computing power leads to more opportunities for a qualitative change in the results with more scenarios of models with greater complexity to simulate and improve the uncertainty estimation. The same argument applies to the elastic storage solution. As previously stated, the concurrent computations produce a substantial amount of file I/O that quickly becomes a bottleneck on a traditional network file system (NFS). The FSx for Lustre service delivers a fully managed parallel file system that can sustain millions of IOPs and hundreds of gigabytes per second throughput[11].

Interoperability of Cloud Services

Simply put, AWS services work well together. The interoperability of the services is seamless and optimized to avoid potential bottlenecks, which is essential for healthy and robust operation. A notable example would be the integration of EC2, AWS Batch and FSx for Lustre. The services manage a substantial amount of infrastructure, yet getting access to an FSx file system from a Docker container is straightforward and can be summarized in two steps: 1) install Lustre client in the container and 2) mount the file system. The Lustre client can be pre-installed in the Docker image. The mount command can be generated programmatically and added to the batch job script at run time. When implemented this way, the container image does not depend on a pre-existing FSx for Lustre file system. Decoupling the Docker images from specifics of a particular computational job adds an extra layer of flexibility to the architecture.

Data Repository Task is another example of the interoperability that streamlines the design. The task manages data transfer between Amazon FSx for Lustre file system and durable repository on Amazon S3 (refer to Figure 1). We’ve observed backup rates close to 1GB/sec, which is sufficient for our needs. It is important to note that the speed depends on the FSx size and it scales linearly. Data Repository Task significantly simplifies and optimizes the storage component of the architecture to the point where the FSx storage option becomes a deployment implementation detail that does not need to be exposed to the user.

Technical Support by AWS Architects

Last but not least, CGG would like to thank a team of AWS architects for their support and guidance. On a special note, we would like to mention the effectiveness of the AWS Well-Architected review of our solution that they conducted with CGG. Both the review procedure and its outcome provided very valuable insight to the development team.

Conclusion

The AWS cloud is a good choice for deploying HPC workloads. It offers scalable, elastic services at a compelling price/performance ratio. CGG conducted numerous pre-production tests to assess levels of performance and throughput while focusing on the specifics of our problem domain. The tests confirmed that the cloud-based solution is capable of handling the most demanding geostatistical workflows.


[1] https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo

[2] AWS Batch

[3] Amazon FSx for Lustre

[4] https://aws.amazon.com/s3/

[5] https://aws.amazon.com/fr/sdk-for-python/

[6] AWS Elastic File System

[7] AWS ASL States language

[8] AWS CloudFormation

[9] AWS Step Functions

[10] Quotas Related to State Machine Executions

Return to Solution Channel Homepage
Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

African Supercomputing Center Inaugurates ‘Toubkal,’ Most Powerful Supercomputer on the Continent

February 25, 2021

Historically, Africa hasn’t exactly been synonymous with supercomputing. There are only a handful of supercomputers on the continent, with few ranking on the global stage. Now, the Mohammed VI Polytechnic University (U Read more…

By Oliver Peckham

Supercomputer-Powered Machine Learning Supports Fusion Energy Reactor Design

February 25, 2021

Energy researchers have been reaching for the stars for decades in their attempt to artificially recreate a stable fusion energy reactor. If successful, such a reactor would revolutionize the world’s energy supply over Read more…

By Oliver Peckham

Japan to Debut Integrated Fujitsu HPC/AI Supercomputer This Spring

February 25, 2021

The integrated Fujitsu HPC/AI Supercomputer, Wisteria, is coming to Japan this spring. The University of Tokyo is preparing to deploy a heterogeneous computing system, called "Wisteria/BDEC-01," that will tackle simulati Read more…

By Tiffany Trader

President Biden Signs Executive Order to Review Chip, Other Supply Chains

February 24, 2021

U.S. President Biden signed an executive order late today calling for a 100-day review of key supply chains including semiconductors, large capacity batteries, pharmaceuticals, and rare-earth elements. The scarcity of ch Read more…

By John Russell

Xilinx Launches Alveo SN1000 SmartNIC

February 24, 2021

FPGA vendor Xilinx has debuted its latest SmartNIC model, the Alveo SN1000, with integrated “composability” features that allow enterprise users to add their own custom networking functions to supplement its built-in networking. By providing deep flexibility... Read more…

By Todd R. Weiss

AWS Solution Channel

Introducing AWS HPC Tech Shorts

Amazon Web Services (AWS) is excited to announce a new videos series focused on running HPC workloads on AWS. This new video series will cover HPC workloads from genomics, computational chemistry, to computational fluid dynamics (CFD) and more. Read more…

ASF Keynotes Showcase How HPC and Big Data Have Pervaded the Pandemic

February 24, 2021

Last Thursday, a range of experts joined the Advanced Scale Forum (ASF) in a rapid-fire roundtable to discuss how advanced technologies have transformed the way humanity responded to the COVID-19 pandemic in indelible ways. The roundtable, held near the one-year mark of the first... Read more…

By Oliver Peckham

Japan to Debut Integrated Fujitsu HPC/AI Supercomputer This Spring

February 25, 2021

The integrated Fujitsu HPC/AI Supercomputer, Wisteria, is coming to Japan this spring. The University of Tokyo is preparing to deploy a heterogeneous computing Read more…

By Tiffany Trader

Xilinx Launches Alveo SN1000 SmartNIC

February 24, 2021

FPGA vendor Xilinx has debuted its latest SmartNIC model, the Alveo SN1000, with integrated “composability” features that allow enterprise users to add their own custom networking functions to supplement its built-in networking. By providing deep flexibility... Read more…

By Todd R. Weiss

ASF Keynotes Showcase How HPC and Big Data Have Pervaded the Pandemic

February 24, 2021

Last Thursday, a range of experts joined the Advanced Scale Forum (ASF) in a rapid-fire roundtable to discuss how advanced technologies have transformed the way humanity responded to the COVID-19 pandemic in indelible ways. The roundtable, held near the one-year mark of the first... Read more…

By Oliver Peckham

IBM’s Prototype Low-Power 7nm AI Chip Offers ‘Precision Scaling’

February 23, 2021

IBM has released details of a prototype AI chip geared toward low-precision training and inference across different AI model types while retaining model quality within AI applications. In a paper delivered during this year’s International Solid-State Circuits Virtual Conference, IBM... Read more…

By George Leopold

IBM Continues Mainstreaming Power Systems and Integrating Red Hat in Pivot to Cloud

February 23, 2021

As IBM continues its massive pivot to the cloud, its Power-microprocessor-based products are being mainstreamed and realigned with the corporate-wide strategy. Read more…

By John Russell

Livermore’s El Capitan Supercomputer to Debut HPE ‘Rabbit’ Near Node Local Storage

February 18, 2021

A near node local storage innovation called Rabbit factored heavily into Lawrence Livermore National Laboratory’s decision to select Cray’s proposal for its CORAL-2 machine, the lab’s first exascale-class supercomputer, El Capitan. Details of this new storage technology were revealed... Read more…

By Tiffany Trader

ENIAC at 75: Celebrating the World’s First Supercomputer

February 15, 2021

With little fanfare, today’s computer revolution was arguably born and announced through a small, innocuous, two-column story at the bottom of the front page of The New York Times on Feb. 15, 1946. In that story and others, the previously classified project, ENIAC... Read more…

By Todd R. Weiss

Microsoft, HPE Bringing AI, Edge, Cloud to Earth Orbit in Preparation for Mars Missions

February 12, 2021

The International Space Station will soon get a delivery of powerful AI, edge and cloud computing tools from HPE and Microsoft Azure to expand technology experi Read more…

By Todd R. Weiss

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

By John Russell

Esperanto Unveils ML Chip with Nearly 1,100 RISC-V Cores

December 8, 2020

At the RISC-V Summit today, Art Swift, CEO of Esperanto Technologies, announced a new, RISC-V based chip aimed at machine learning and containing nearly 1,100 low-power cores based on the open-source RISC-V architecture. Esperanto Technologies, headquartered in... Read more…

By Oliver Peckham

Azure Scaled to Record 86,400 Cores for Molecular Dynamics

November 20, 2020

A new record for HPC scaling on the public cloud has been achieved on Microsoft Azure. Led by Dr. Jer-Ming Chia, the cloud provider partnered with the Beckman I Read more…

By Oliver Peckham

NICS Unleashes ‘Kraken’ Supercomputer

April 4, 2008

A Cray XT4 supercomputer, dubbed Kraken, is scheduled to come online in mid-summer at the National Institute for Computational Sciences (NICS). The soon-to-be petascale system, and the resulting NICS organization, are the result of an NSF Track II award of $65 million to the University of Tennessee and its partners to provide next-generation supercomputing for the nation's science community. Read more…

Programming the Soon-to-Be World’s Fastest Supercomputer, Frontier

January 5, 2021

What’s it like designing an app for the world’s fastest supercomputer, set to come online in the United States in 2021? The University of Delaware’s Sunita Chandrasekaran is leading an elite international team in just that task. Chandrasekaran, assistant professor of computer and information sciences, recently was named... Read more…

By Tracey Bryant

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

By Doug Black

Top500: Fugaku Keeps Crown, Nvidia’s Selene Climbs to #5

November 16, 2020

With the publication of the 56th Top500 list today from SC20's virtual proceedings, Japan's Fugaku supercomputer – now fully deployed – notches another win, Read more…

By Tiffany Trader

Gordon Bell Special Prize Goes to Massive SARS-CoV-2 Simulations

November 19, 2020

2020 has proven a harrowing year – but it has produced remarkable heroes. To that end, this year, the Association for Computing Machinery (ACM) introduced the Read more…

By Oliver Peckham

Leading Solution Providers

Contributors

Texas A&M Announces Flagship ‘Grace’ Supercomputer

November 9, 2020

Texas A&M University has announced its next flagship system: Grace. The new supercomputer, named for legendary programming pioneer Grace Hopper, is replacing the Ada system (itself named for mathematician Ada Lovelace) as the primary workhorse for Texas A&M’s High Performance Research Computing (HPRC). Read more…

By Oliver Peckham

At Oak Ridge, ‘End of Life’ Sometimes Isn’t

October 31, 2020

Sometimes, the old dog actually does go live on a farm. HPC systems are often cursed with short lifespans, as they are continually supplanted by the latest and Read more…

By Oliver Peckham

Saudi Aramco Unveils Dammam 7, Its New Top Ten Supercomputer

January 21, 2021

By revenue, oil and gas giant Saudi Aramco is one of the largest companies in the world, and it has historically employed commensurate amounts of supercomputing Read more…

By Oliver Peckham

Intel Xe-HP GPU Deployed for Aurora Exascale Development

November 17, 2020

At SC20, Intel announced that it is making its Xe-HP high performance discrete GPUs available to early access developers. Notably, the new chips have been deplo Read more…

By Tiffany Trader

Intel Teases Ice Lake-SP, Shows Competitive Benchmarking

November 17, 2020

At SC20 this week, Intel teased its forthcoming third-generation Xeon "Ice Lake-SP" server processor, claiming competitive benchmarking results against AMD's second-generation Epyc "Rome" processor. Ice Lake-SP, Intel's first server processor with 10nm technology... Read more…

By Tiffany Trader

New Deep Learning Algorithm Solves Rubik’s Cube

July 25, 2018

Solving (and attempting to solve) Rubik’s Cube has delighted millions of puzzle lovers since 1974 when the cube was invented by Hungarian sculptor and archite Read more…

By John Russell

It’s Fugaku vs. COVID-19: How the World’s Top Supercomputer Is Shaping Our New Normal

November 9, 2020

Fugaku is currently the most powerful publicly ranked supercomputer in the world – but we weren’t supposed to have it yet. The supercomputer, situated at Japan’s Riken scientific research institute, was scheduled to come online in 2021. When the pandemic struck... Read more…

By Oliver Peckham

MIT Makes a Big Breakthrough in Nonsilicon Transistors

December 10, 2020

What if Silicon Valley moved beyond silicon? In the 80’s, Seymour Cray was asking the same question, delivering at Supercomputing 1988 a talk titled “What’s All This About Gallium Arsenide?” The supercomputing legend intended to make gallium arsenide (GaA) the material of the future... Read more…

By Oliver Peckham

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire