Kubernetes, Containers and HPC

By Daniel Gruber, Burak Yenier and Wolfgang Gentzsch, UberCloud

September 19, 2019

Software containers and Kubernetes are important tools for building, deploying, running and managing modern enterprise applications at scale and delivering enterprise software faster and more reliably to the end user — while using resources more efficiently and reducing costs.

Recently, high performance computing (HPC) is moving closer to the enterprise and can therefore benefit from an HPC container and Kubernetes ecosystem, with new requirements to quickly allocate and deallocate computational resources to HPC workloads. Compute capacity, be it for enterprise or HPC workloads, can no longer be planned years in advance.

Getting on-demand resources from a shared compute resource pool has never been easier as cloud service providers and software vendors are continuously investing and improving their services. In enterprise computing, packaging software in container images and running containers is meanwhile standard. Kubernetes has become the most widely used container and resource orchestrator in the Fortune 500 companies. The HPC community, led by efforts in the AI community, is picking up the concept and applying it to batch jobs and interactive applications.

Containers and HPC: Why Should We Care?

HPC leaders have a hard time. There are lots of changes and new ways of thinking in software technology and IT operations. Containerization have become ubiquitous; container orchestration with Kubernetes is the new standard. Deep learning workloads are continuously increasing their footprint, and Site Reliability Engineering (SRE) has been adopted on many sites. It is very hard for each of the new technologies to judge their usefulness for HPC type of workloads.

Introduction to Kubernetes

If your engineers or operators are running a single container on their laptop, they probably use Docker for doing that. But when having multiples of containers potentially on dozens or hundreds of machines, it becomes difficult to get them maintained. Kubernetes simplifies container orchestration by providing scheduling, container life-cycle management, networking functionalities and more in a scalable and extensible platform.

Major components of Kubernetes are the Kubernetes master which contains an API server, scheduler, and a controller manager. Controllers are a main concept: they watch out for the current state of resources and compare them with the expected state. If they differ, they take actions to move to an expected state. On the execution side we have the kubelet which is in contact with the master as well as a network proxy. The kubelet manages containers by using the container runtime interface (CRI) for interacting with runtimes like Docker, containerd, or CRI-O.

Running Kubernetes or HPC Schedulers?

Kubernetes is doing workload and resource management. Sounds familiar? Yes, in many ways it shares lots of functionalities with traditional HPC workload managers. The main differences are the workload types they focus on. While HPC workload managers are focused on running distributed memory jobs and support high-throughput scenarios, Kubernetes is primarily built for orchestrating containerized microservice applications.

HPC workload managers like Univa Grid Engine added a huge number of features in the last decades. Some notable functionalities are:

–  Support for shared and distributed memory (like MPI based) jobs

–  Advance reservations for allocating and blocking resources in advance

–  Fair-share to customize resource usage patterns across users, projects, and departments

–  Resource reservation for collecting resources for large jobs

–  Preemption for stopping low prior jobs in favor for running high prior jobs

–  NUMA aware scheduling for automatically allocating cores and sockets

–  Adhere to standards for job submission and management (like DRMAA and DRMAAv2)

HPC workload managers are tuned for speed, throughput, and scalability, being capable of running millions of batch jobs a day and supporting the infrastructure of the largest supercomputers in the world. What traditional HPC workload managers lack are means for supporting microservice architectures, deeply integrated container management capabilities, network management, and application life-cycle management. They are primarily built for running batch jobs in different scenarios like high-throughput, MPI jobs spanning across potentially hundreds or thousands of nodes, jobs running weeks, or jobs using special resource types (GPUs, FPGAs, licenses, etc.).

Kubernetes on the other hand is built for containerized microservice applications from the bottom- up. Some notable features are:

–  Management of sets of pods. Pods consist of one or more co-located containers.

–  Networking functionalities through a pluggable overlay network

–  Self-healing through controller concept comparing expected with current state

–  Declarative style configuration

–  Load balancing functionalities

–  Rolling updates of different versions of workloads

–  Integrations in many monitoring and logging solutions

–  Hooks to integrate external persistent storage in pods

–  Service discovery and routing

What Kubernetes lacks at this time is a proper high-throughput batch job queueing system with a sophisticated rule system for managing resource allocations. But one of the main drawbacks we see is that traditional HPC engineering applications are not yet built to interact with Kubernetes. But this will change in the future. New kinds of AI workloads on the other hand are supporting Kubernetes already very well – in fact many of these packages are targeted to Kubernetes.

Can HPC Workload Be Managed by Kubernetes?

We should combine both Kubernetes and HPC workload-management systems to fully meet the HPC requirements. Kubernetes will be used for managing HPC containers along with all the required services. Inside the containers, not just the engineering application can be run, but also the capability to either plug into an existing HPC cluster or run an entire HPC resource manager installation (like SLURM or Univa Grid Engine) needs to be provided. In that way, we can provide compatibility to the engineering applications and can exploit the extended batch scheduling capabilities. At the same time our whole deployment can be operated in all Kubernetes enabled environments with the advantages of standardized container orchestration.

Ease of Administration

HPC environments consist of a potentially large set of containers. The higher abstraction of container orchestration compared to self-managing container single runtime engines provides the necessary flexibility we need to fulfill different customer requirements. Management operations like scaling the deployment are much simpler to implement and execute.

The Run-Time for Hybrid and Multi-Cloud Offers True Portability

Portability is a key value of containers. Kubernetes provides us this portability for fleets of containers. We can have the same experience on-premises as well as on different cloud infrastructures. Engineers can seamlessly switch the infrastructure without any changes for the engineers. In that way we can choose the infrastructure by criteria like price, performance, and capabilities. When running on premises we can start offering true hybrid-cloud experience by providing a consistent infrastructure with the same operational and HPC application experience and seamlessly use on-demand cloud resources when required.

What’s Next?

Embracing Kubernetes for the specific requirements of HPC and engineering workload is not straight forward. But due to the success of Kubernetes and its open and extensible architecture the ecosystem is opening up for HPC applications primarily driven by the demand of new AI workloads and HPC containers.

About the Authors

Daniel Gruber, Burak Yenier, and Wolfgang Gentzsch are with UberCloud, a company that started in 2013 with developing HPC container technology and containerized engineering applications, to facilitate access and use of engineering HPC workload in a shared on-premise or on-demand cloud environment. This article is based on a white paper they wrote detailing their experience using UberCloud HPC containers and Kubernetes.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire