Trish Damkroger (VP, Intel Data Center) recently said “High-performance computing is a strategic capability to accelerate scientific discovery and industrial innovation, further driving our economic competitiveness, technology leadership, and national security.”
Intel® Select Solutions for High Performance Computing (HPC) offer easy and quick-to-deploy infrastructure that removes the complexity of advanced computing and helps accelerate the time to actionable insights for users in industry and science. Our portfolio includes workload-optimized configurations for HPC & AI Converged Clusters. This solutions shares a common base architecture and complies with the Intel HPC Platform Specification. This offers each solution the advantage of validated compatibility with a wide range of HPC workloads, including those listed in the Intel HPC Application Catalog.
Intel Select Solutions for Simulation & Modeling serve as a common foundation for the family of solutions and are designed for productivity, compatibility, and workload-optimized performance across a broad range of traditional HPC applications. Intel Select Solutions for HPC & AI Converged Clusters extends the Simulation & Modeling solution to allow users run a wide range of analytics and AI applications on common infrastructure, maximizing flexibility, improving utilization, and supporting the trend toward converged workflows that utilize combinations of simulation, modeling, analytics, and AI workloads together for accelerated discovery and insights.
No longer just HPC, modern clusters must support a mix of AI, traditional HPC modeling and simulation, and HPDA (High Performance Data Analytics) workloads. Similarly, modern clusters must be able to run traditional HPC batch jobs as well as support local, private cloud jobs and bursting to the public cloud.
Focusing on individual clusters, Intel® Select Solutions offers pre-validated workload-optimized servers for HPC-AI-HPDA workloads.
Enabling customers to take the next step, Intel has put significant effort into developing a set of solutions that support pooling of individual workload-optimized clusters to create a uniform computing environment that can run cloud and HPC, AI and HPDA workloads 24/7 on all the hardware – even when the clusters are located across a WAN at geographically distant locations, and even when some of the user base wants to run accelerators or run apps in a cloud-based environment.
Pooled workload-optimized clusters support disparate computational needs
When confronted with new AI and HPDA workloads, many organizations start to add AI capabilities and clusters piecemeal. This leads to a patchwork hodge-podge of disparate systems, wasted resources, and a tangled mess of a software ecosystem.
Look to your servers first
Workload-optimized servers are the heart of every cluster. As a datacenter leader, Intel recommends a variety of pre-validated Intel Select Solutions that efficiently run HPC-AI-HPDA workloads. These solutions can be purchased from a number of vendors.
Maximize work with pooled clusters
Pooling resources just makes sense.
For example, most organizations don’t run their infrastructure for deep learning networks on a 24×7 basis. The part-time nature of these workloads means that the special-purpose infrastructure often sits idle and may require rarefied skills to support, both of which can be costly to the business.
Instead, Intel advocates pooling these clusters together into a unified cluster architecture as shown in the graphic below.
Unified clusters maximize the value of existing resources because the resource manager, not humans, works 24/7 to keep the hardware busy.
Bringing cloud and AI innovation to your data center
Build your clusters to be your supercomputer “secret weapon” while leveraging the tremendous amount of work and innovation that is being put into new hardware and an amazing ecosystem of industry-standard software tools and libraries.
AI and HPDA
Succinctly, a rapidly maturing software ecosystem of industry-standard AI and data analytics tools coupled with a remarkable growth in electronically analyzable data now lets users work with data in ways that have transformed the computer industry and created a new era in HPC. In reality, HPC should now be considered as HPC-AI-HPDA.
The following figure illustrates the breadth of adoption of AI and HPDA across a number of HPC oriented industries and market segments. Clearly, AI will remain a workload in the HPC datacenter.
Cloud as a computing resource and bellwether to the future
Similarly, cloud computing is acting as a massive source of resources and innovation that gives everyone access to a supercomputer “secret weapon” for their AI and HPC needs.
Not only does the cloud give a huge mass audience of SMBs (Small and Medium Businesses) and small research teams access to software tools that can run at supercomputer scale with supercomputer class performance, it also acts as a bellwether of new technology trends.
Dan Stanzione (Executive Director at TACC) succinctly summarizes this by stating, “Giving users access to the cloud means they can experiment with the latest architectures as cloud providers are deploying those all the time.” [i]
Meeting disparate needs with pooled workload-optimized clusters.
To gain the advantages of pooled clusters in a local environment, Intel suggests the following which are discussed in greater detail in “The Intel® Select Solutions for HPC AI & Converged Clusters Solution Brief”.
Integrating pooled clusters and the private cloud into HPC batch schedulers
To support a pooled environment using existing HPC batch schedulers, Intel has created a solution for popular batch schedulers that help when submitting jobs on behalf of AI or analytics workloads so they can run efficiently. The abstraction offered by these solutions dramatically simplifies implementation for customers.
Succinctly, use the Univa Grid Engine* or the open-source Magpie for SLURM to add support for cloud and AI jobs in a batch HPC environment. Thus simulation and modeling workloads continue to operate as usual, creating a unified environment from the standpoint of resource management, yet users can burst to the cloud to save on-premises resources, or experiment with new hardware and software in the cloud.
Tying it all together into a unified data environment
To create a unified environment, Intel recommends using the open-source Alluxio* storage abstraction for all pooled clusters.
Succinctly, Alluxio creates a single point of access to data so applications can transparently access data in-place without complex, time-consuming configuration requirements. Eliminating the need to move or duplicate data around the enterprise creates significant performance and efficiency gains.
For a more detailed discussion on pre-validated workload-optimized server configurations and their compatibility with selected open-source batch schedulers should read “Intel Select Solutions for HPC AI Converged with Open-Source Batch Schedulers Solution Brief”.
Explore the Intel Select Solutions for HPC capabilities and performance optimized configurations at https://www.intel.com/selectsolutions.
*Other names and brands may be claimed as the property of others.
[i] Cloud is one component in TACC’s strategy for the Frontera supercomputer, which will be the fastest system at any U.S. university and one of the most powerful supercomputers in the world.