Kubernetes, Containers and HPC

By Daniel Gruber, Burak Yenier and Wolfgang Gentzsch, UberCloud

September 19, 2019

Software containers and Kubernetes are important tools for building, deploying, running and managing modern enterprise applications at scale and delivering enterprise software faster and more reliably to the end user — while using resources more efficiently and reducing costs.

Recently, high performance computing (HPC) is moving closer to the enterprise and can therefore benefit from an HPC container and Kubernetes ecosystem, with new requirements to quickly allocate and deallocate computational resources to HPC workloads. Compute capacity, be it for enterprise or HPC workloads, can no longer be planned years in advance.

Getting on-demand resources from a shared compute resource pool has never been easier as cloud service providers and software vendors are continuously investing and improving their services. In enterprise computing, packaging software in container images and running containers is meanwhile standard. Kubernetes has become the most widely used container and resource orchestrator in the Fortune 500 companies. The HPC community, led by efforts in the AI community, is picking up the concept and applying it to batch jobs and interactive applications.

Containers and HPC: Why Should We Care?

HPC leaders have a hard time. There are lots of changes and new ways of thinking in software technology and IT operations. Containerization have become ubiquitous; container orchestration with Kubernetes is the new standard. Deep learning workloads are continuously increasing their footprint, and Site Reliability Engineering (SRE) has been adopted on many sites. It is very hard for each of the new technologies to judge their usefulness for HPC type of workloads.

Introduction to Kubernetes

If your engineers or operators are running a single container on their laptop, they probably use Docker for doing that. But when having multiples of containers potentially on dozens or hundreds of machines, it becomes difficult to get them maintained. Kubernetes simplifies container orchestration by providing scheduling, container life-cycle management, networking functionalities and more in a scalable and extensible platform.

Major components of Kubernetes are the Kubernetes master which contains an API server, scheduler, and a controller manager. Controllers are a main concept: they watch out for the current state of resources and compare them with the expected state. If they differ, they take actions to move to an expected state. On the execution side we have the kubelet which is in contact with the master as well as a network proxy. The kubelet manages containers by using the container runtime interface (CRI) for interacting with runtimes like Docker, containerd, or CRI-O.

Running Kubernetes or HPC Schedulers?

Kubernetes is doing workload and resource management. Sounds familiar? Yes, in many ways it shares lots of functionalities with traditional HPC workload managers. The main differences are the workload types they focus on. While HPC workload managers are focused on running distributed memory jobs and support high-throughput scenarios, Kubernetes is primarily built for orchestrating containerized microservice applications.

HPC workload managers like Univa Grid Engine added a huge number of features in the last decades. Some notable functionalities are:

–  Support for shared and distributed memory (like MPI based) jobs

–  Advance reservations for allocating and blocking resources in advance

–  Fair-share to customize resource usage patterns across users, projects, and departments

–  Resource reservation for collecting resources for large jobs

–  Preemption for stopping low prior jobs in favor for running high prior jobs

–  NUMA aware scheduling for automatically allocating cores and sockets

–  Adhere to standards for job submission and management (like DRMAA and DRMAAv2)

HPC workload managers are tuned for speed, throughput, and scalability, being capable of running millions of batch jobs a day and supporting the infrastructure of the largest supercomputers in the world. What traditional HPC workload managers lack are means for supporting microservice architectures, deeply integrated container management capabilities, network management, and application life-cycle management. They are primarily built for running batch jobs in different scenarios like high-throughput, MPI jobs spanning across potentially hundreds or thousands of nodes, jobs running weeks, or jobs using special resource types (GPUs, FPGAs, licenses, etc.).

Kubernetes on the other hand is built for containerized microservice applications from the bottom- up. Some notable features are:

–  Management of sets of pods. Pods consist of one or more co-located containers.

–  Networking functionalities through a pluggable overlay network

–  Self-healing through controller concept comparing expected with current state

–  Declarative style configuration

–  Load balancing functionalities

–  Rolling updates of different versions of workloads

–  Integrations in many monitoring and logging solutions

–  Hooks to integrate external persistent storage in pods

–  Service discovery and routing

What Kubernetes lacks at this time is a proper high-throughput batch job queueing system with a sophisticated rule system for managing resource allocations. But one of the main drawbacks we see is that traditional HPC engineering applications are not yet built to interact with Kubernetes. But this will change in the future. New kinds of AI workloads on the other hand are supporting Kubernetes already very well – in fact many of these packages are targeted to Kubernetes.

Can HPC Workload Be Managed by Kubernetes?

We should combine both Kubernetes and HPC workload-management systems to fully meet the HPC requirements. Kubernetes will be used for managing HPC containers along with all the required services. Inside the containers, not just the engineering application can be run, but also the capability to either plug into an existing HPC cluster or run an entire HPC resource manager installation (like SLURM or Univa Grid Engine) needs to be provided. In that way, we can provide compatibility to the engineering applications and can exploit the extended batch scheduling capabilities. At the same time our whole deployment can be operated in all Kubernetes enabled environments with the advantages of standardized container orchestration.

Ease of Administration

HPC environments consist of a potentially large set of containers. The higher abstraction of container orchestration compared to self-managing container single runtime engines provides the necessary flexibility we need to fulfill different customer requirements. Management operations like scaling the deployment are much simpler to implement and execute.

The Run-Time for Hybrid and Multi-Cloud Offers True Portability

Portability is a key value of containers. Kubernetes provides us this portability for fleets of containers. We can have the same experience on-premises as well as on different cloud infrastructures. Engineers can seamlessly switch the infrastructure without any changes for the engineers. In that way we can choose the infrastructure by criteria like price, performance, and capabilities. When running on premises we can start offering true hybrid-cloud experience by providing a consistent infrastructure with the same operational and HPC application experience and seamlessly use on-demand cloud resources when required.

What’s Next?

Embracing Kubernetes for the specific requirements of HPC and engineering workload is not straight forward. But due to the success of Kubernetes and its open and extensible architecture the ecosystem is opening up for HPC applications primarily driven by the demand of new AI workloads and HPC containers.

About the Authors

Daniel Gruber, Burak Yenier, and Wolfgang Gentzsch are with UberCloud, a company that started in 2013 with developing HPC container technology and containerized engineering applications, to facilitate access and use of engineering HPC workload in a shared on-premise or on-demand cloud environment. This article is based on a white paper they wrote detailing their experience using UberCloud HPC containers and Kubernetes.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Watch Nvidia’s GTC21 Keynote with Jensen Huang Livestreamed Here, Monday at 8:30am PT

April 9, 2021

Join HPCwire right here on Monday, April 12, at 8:30 am PT to see the Nvidia GTC21 keynote from Nvidia’s CEO, Jensen Huang, livestreamed in its entirety. Hosted by HPCwire, you can click to join the Huang keynote on our livestream to hear Nvidia’s expected news and... Read more…

The US Places Seven Additional Chinese Supercomputing Entities on Blacklist

April 8, 2021

As tensions between the U.S. and China continue to simmer, the U.S. government today added seven Chinese supercomputing entities to an economic blacklist. The U.S. Entity List bars U.S. firms from supplying key technolog Read more…

Argonne Supercomputing Supports Caterpillar Engine Design

April 8, 2021

Diesel fuels still account for nearly ten percent of all energy-related U.S. carbon emissions – most of them from heavy-duty vehicles like trucks and construction equipment. Energy efficiency is key to these machines, Read more…

Habana’s AI Silicon Comes to San Diego Supercomputer Center

April 8, 2021

Habana Labs, an Intel-owned AI company, has partnered with server maker Supermicro to provide high-performance, high-efficiency AI computing in the form of new training and inference servers that will power the upcoming Read more…

Intel Partners Debut Latest Servers Based on the New Intel Gen 3 ‘Ice Lake’ Xeons

April 7, 2021

Fresh from Intel’s launch of the company’s latest third-generation Xeon Scalable “Ice Lake” processors on April 6 (Tuesday), Intel server partners Cisco, Dell EMC, HPE and Lenovo simultaneously unveiled their first server models built around the latest chips. And though arch-rival AMD may... Read more…

AWS Solution Channel

Volkswagen Passenger Cars Uses NICE DCV for High-Performance 3D Remote Visualization

 

Volkswagen Passenger Cars has been one of the world’s largest car manufacturers for over 70 years. The company delivers more than 6 million automobiles to global customers every year, from 50 production locations on five continents. Read more…

What’s New in HPC Research: Tundra, Fugaku, µHPC & More

April 6, 2021

In this regular feature, HPCwire highlights newly published research in the high-performance computing community and related domains. From parallel programming to exascale to quantum computing, the details are here. Read more…

The US Places Seven Additional Chinese Supercomputing Entities on Blacklist

April 8, 2021

As tensions between the U.S. and China continue to simmer, the U.S. government today added seven Chinese supercomputing entities to an economic blacklist. The U Read more…

Habana’s AI Silicon Comes to San Diego Supercomputer Center

April 8, 2021

Habana Labs, an Intel-owned AI company, has partnered with server maker Supermicro to provide high-performance, high-efficiency AI computing in the form of new Read more…

Intel Partners Debut Latest Servers Based on the New Intel Gen 3 ‘Ice Lake’ Xeons

April 7, 2021

Fresh from Intel’s launch of the company’s latest third-generation Xeon Scalable “Ice Lake” processors on April 6 (Tuesday), Intel server partners Cisco, Dell EMC, HPE and Lenovo simultaneously unveiled their first server models built around the latest chips. And though arch-rival AMD may... Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

HPE Launches Storage Line Loaded with IBM’s Spectrum Scale File System

April 6, 2021

HPE today launched a new family of storage solutions bundled with IBM’s Spectrum Scale Erasure Code Edition parallel file system (description below) and featu Read more…

RIKEN’s Ongoing COVID Research Includes New Vaccines, New Tests & More

April 6, 2021

RIKEN took the supercomputing world by storm last summer when it launched Fugaku – which became (and remains) the world’s most powerful supercomputer – ne Read more…

CERN Is Betting Big on Exascale

April 1, 2021

The European Organization for Nuclear Research (CERN) involves 23 countries, 15,000 researchers, billions of dollars a year, and the biggest machine in the worl Read more…

AI Systems Summit Keynote: Brace for System Level Heterogeneity Says de Supinski

April 1, 2021

Heterogeneous computing has quickly come to mean packing a couple of CPUs and one-or-many accelerators, mostly GPUs, onto the same node. Today, a one-such-node system has become the standard AI server offered by dozens of vendors. This is not to diminish the many advances... Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

CERN Is Betting Big on Exascale

April 1, 2021

The European Organization for Nuclear Research (CERN) involves 23 countries, 15,000 researchers, billions of dollars a year, and the biggest machine in the worl Read more…

Programming the Soon-to-Be World’s Fastest Supercomputer, Frontier

January 5, 2021

What’s it like designing an app for the world’s fastest supercomputer, set to come online in the United States in 2021? The University of Delaware’s Sunita Chandrasekaran is leading an elite international team in just that task. Chandrasekaran, assistant professor of computer and information sciences, recently was named... Read more…

HPE Launches Storage Line Loaded with IBM’s Spectrum Scale File System

April 6, 2021

HPE today launched a new family of storage solutions bundled with IBM’s Spectrum Scale Erasure Code Edition parallel file system (description below) and featu Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

Saudi Aramco Unveils Dammam 7, Its New Top Ten Supercomputer

January 21, 2021

By revenue, oil and gas giant Saudi Aramco is one of the largest companies in the world, and it has historically employed commensurate amounts of supercomputing Read more…

Quantum Computer Start-up IonQ Plans IPO via SPAC

March 8, 2021

IonQ, a Maryland-based quantum computing start-up working with ion trap technology, plans to go public via a Special Purpose Acquisition Company (SPAC) merger a Read more…

Leading Solution Providers

Contributors

Can Deep Learning Replace Numerical Weather Prediction?

March 3, 2021

Numerical weather prediction (NWP) is a mainstay of supercomputing. Some of the first applications of the first supercomputers dealt with climate modeling, and Read more…

Livermore’s El Capitan Supercomputer to Debut HPE ‘Rabbit’ Near Node Local Storage

February 18, 2021

A near node local storage innovation called Rabbit factored heavily into Lawrence Livermore National Laboratory’s decision to select Cray’s proposal for its CORAL-2 machine, the lab’s first exascale-class supercomputer, El Capitan. Details of this new storage technology were revealed... Read more…

New Deep Learning Algorithm Solves Rubik’s Cube

July 25, 2018

Solving (and attempting to solve) Rubik’s Cube has delighted millions of puzzle lovers since 1974 when the cube was invented by Hungarian sculptor and archite Read more…

African Supercomputing Center Inaugurates ‘Toubkal,’ Most Powerful Supercomputer on the Continent

February 25, 2021

Historically, Africa hasn’t exactly been synonymous with supercomputing. There are only a handful of supercomputers on the continent, with few ranking on the Read more…

The History of Supercomputing vs. COVID-19

March 9, 2021

The COVID-19 pandemic poses a greater challenge to the high-performance computing community than any before. HPCwire's coverage of the supercomputing response t Read more…

HPE Names Justin Hotard New HPC Chief as Pete Ungaro Departs

March 2, 2021

HPE CEO Antonio Neri announced today (March 2, 2021) the appointment of Justin Hotard as general manager of HPC, mission critical solutions and labs, effective Read more…

AMD Launches Epyc ‘Milan’ with 19 SKUs for HPC, Enterprise and Hyperscale

March 15, 2021

At a virtual launch event held today (Monday), AMD revealed its third-generation Epyc “Milan” CPU lineup: a set of 19 SKUs -- including the flagship 64-core, 280-watt 7763 part --  aimed at HPC, enterprise and cloud workloads. Notably, the third-gen Epyc Milan chips achieve 19 percent... Read more…

Microsoft, HPE Bringing AI, Edge, Cloud to Earth Orbit in Preparation for Mars Missions

February 12, 2021

The International Space Station will soon get a delivery of powerful AI, edge and cloud computing tools from HPE and Microsoft Azure to expand technology experi Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire