The general consensus that AI would “change everything” is proving to be correct when it comes to infrastructure impacts. The new imperative is to raise and maintain operational standards to avoid inevitable outages caused by adding AI to existing infrastructure. Already key elements are straining to keep pace. It’s hard to tell which new AI workload addition might break it altogether.
That is especially true when you consider there are so many AI projects being added at the same time across multi-tenant data centers. With so much at stake and so little under your control, what can be done to ensure your AI workloads sail through at scale?
High risks exist when HPC clusters are too quickly deployed in multi-tenant data centers without proper network-level tenancy that aligns with existing systems. This can lead to delays, increased costs, new vulnerabilities, and even service disruptions. It’s crucial to audit, anticipate, and monitor these issues to plan and act accordingly.
Improving Kubernetes Ingress
Kubernetes is embedded within HPC AI cluster operations already. Kubernetes networking allows containerized processes to communicate within the cluster directly by design, and a flat network structure is maintained across each cluster to accomplish this technological feat. But when services need to interact with external applications or other segregated processes within the cluster, a Kubernetes service resource must handle ingress communication. Additional infrastructure components, called Kubernetes Ingress Controllers, are required to manage network traffic ingress that are not a standard network component.
F5, renowned for decades providing network load balancing, most commonly for its F5 BIG-IP product suite, has expanded to offer higher-volume software and hardware designed to adapt networking infrastructure to accommodate accelerating AI growth. In short, F5 BIG-IP offers a Kubernetes Ingress Controller to provide secured ingress for AI and HPC clustered applications. Adding to its appeal is the fact that BIG-IP is a well understood, trusted and accepted component in enterprise data centers for both NetOps and SecOps teams.
Building for Effective Multi-tenancy
Kubernetes nodes use NodeIPs to manage inter-host routing within the cluster, which at first seems like a solid, distributed network design. And it mostly is if the cluster is dedicated to a single tenant. However, traffic from different security tenants within the cluster is sourced from the same NodeIP, making it difficult for traditional monitoring and security tools to differentiate between tenants. This lack of visibility complicates network security, particularly in HPC AI clusters.
Wide adoption of 5G by global network providers (telecoms) extended the scope of the problem because doing so meant also adopting multi-tenant Kubernetes clusters.
This need drove to the evolution of F5 BIG-IP Next for modular versions of the network stack and where BIG-IP Next Service Proxy for Kubernetes (SPK) was born. Beyond ingress, it also manages egress which is now more critical functionality. BIG-IP Next SPK lives both inside the Kubernetes clusters as well as inside the data center network fabric. SPK provides a distributed implementation of BIG-IP, controlled as a Kubernetes resource, which understands both Kubernetes namespace-based tenancy and the network segregation tenancy required by the data center networking fabric. SPK provides a critical central point of control for networking and security ingress and egress for Kubernetes clusters, improving visibility and efficiency and reducing TCO.
Optimizing GPU performance with CPU Offload
A third major area of concern is in optimizing the infrastructure for GPU scale. HPC AI clusters including GPUs have inter-service (East-West) networking requirements which rival the bandwidth of entire geographic continents of mobile traffic. None of the related issues are new to the HPC community but will be “new” news to most enterprises or network service operator teams.
Existing DCs are designed around CPUs for serial processing, so the addition of GPUs changes the dynamics of data traffic—the network is not automatically optimized to best utilize GPU parallel processing. To facilitate connectivity to these highly engineered HPC AI clusters, new generation of network interface hardware, the SuperNIC DPU/IPU (data processing unit/interface processing unit), are being integrated into Kubernetes cluster nodes. Offloading CPU network traffic to these DPU/IPUs offers several benefits: it frees up compute, accelerates data movement, and frees up GPU capacity.
In this example of an offload scenario with SPK deployed, the AI cluster DPUs do not contain not only the new versions of F5’s data plane, but the full BIG-IP network stack. That opens the door for simplified AI service middleware deployments using BIG-IP for both the reverse proxy ingress services and the forward proxy egress services.
Conclusion and CTA
Fortifying infrastructure and bridging gaps to facilitate the massive scale and changes needed for AI workload agility, resilience, management and security to within to fit the specific requirements of AI clusters deployed at skyrocketing rates is no easy undertaking. But it must be done to prevent outages, lags and breakage that can significantly slow or imperil AI application rollouts and adoptions.
For a deeper technical dive, please read our free technical article on DevCentral “Preparing Network Infrastructures for HPC Clusters for AI”