For many organizations, an HPC cluster ranks among its most expensive and complex IT investments. From acquisition to replacement, these complex systems must be proactively designed, managed and operated within the context of a lifecycle, one that is unique from most other IT assets.
To maximize ROI within a rapidly changing market, a pragmatic approach is necessary. This column series will walk through cluster lifecycle management from an initial needs assessment, matching needs to design, vendor selection, deployment, maintenance and finally to the stage of replacement.
So you’ve decided to join the supercomputing club, now what? Ideally, the first order of business is to begin with a needs assessment to determine exactly what you require in terms of HPC technology.
Put simply, understand the intended use of the cluster and expected return to the business. Understanding the cluster’s intended use cases will allow you select a cluster that’s the right size with the right architecture and components. Buy a cluster that is too small to meet demands, and you have a money pit requiring unanticipated expenditures for years to come. Buy one that’s bigger than you need, and you’ve wasted money on an investment with a poor ROI. The management you worked so hard to convince that HPC was just the solution to your problems isn’t likely to forget either scenario.
A Needs Assessment starts with an understanding of the application(s) and types of computations to be executed on the cluster. If you have HPC experts on staff, have them perform an analysis of how personnel will be using the cluster, what the workflow will be, and what software applications will be deployed. In these cases the HPC experts will work with the application vendor or developers to determine the ideal number and type of nodes and most appropriate operating system needed. If you don’t have HPC experts on staff you will need to rely on the hardware and software vendors to work with your cluster users and business units to understand the user requirements and system demands.
More specifically, your HPC experts or the vendors will want to know the typical usage, or performance, profile of each application within your organization.
- How many users access the application at any given time?
- Is the application part of a larger workflow?
- How will they launch the application from their workstations?
These questions are paramount in helping to scope the design of the compute cluster (types of nodes and interconnects) to meet the intended usage, performance and operational objectives, and to maximize overall effectiveness of the cluster in parallel processing.
Next consider data storage capacity and I/O performance, which will be determined in part by how the users access data for input and output of the computations.
- Where is the input data, and how will they get output data and visualize results?
- How do they plan on sharing work with others?
- How big are the input and output files?
In addition to initial cluster demand requirements, considerations should be given in the Needs Assessment for future use or growth of the cluster. It is important to look at the business and technology roadmap associated with the cluster to see how its use may be expanded over time. The organization may have twice as many cluster users; there may be new applications added to the cluster, each requiring additional cluster resources and capabilities such as compute capacity, memory, storage, bandwidth or high-availability.
Once the cluster hardware and software decisions have been made, the more practical consideration of cluster management must be addressed. Infrastructure comes first. Your organization must ask whether it has the space with the proper cooling and electrical networks to install the system. Similar to compute capacity expansions, space expansion must be evaluated to ensure easy expansion in the future.
Will the cluster require High-Availability (HA)? HA is all about ensuring the systems remain available for users even if a major component of the cluster fails. This means redundant components, and duplicating critical components such as head or master nodes, switches, storage and network fabrics. Depending on the redundancy desired, implementing HA components can have a significant impact on cost, design, configuration, management complexity and infrastructure.
Now we need to consider an array of cluster specific tools, applications and services that will be required on the cluster. Somehow all these hardware and software components that make up the entire cluster must operate synergistically. Some services are required for the cluster to operate, others are optional. Similarly, some cluster tools are a must, others a convenience. Services and tools for the cluster vary and there are many options to choose from. Deciding on preferences and capabilities are important decisions for users and managers of the cluster.
An organization with typical IT staff may be disappointed to learn that those individuals lack the expertise to care for and maintain an HPC cluster. The organization must consider hiring new personnel or outsourcing management activities which will likely involve continuous monitoring, proactive cluster management, analyzing, and reporting on the system’s operations. Staffing and services for the management of your cluster must be considered in your Needs Assessment given the painful alternatives.
In this series of columns we will shed further light on many of these issue and components of HPC Clusters. In the next column, we will discuss how the results of your Needs Assessment can be matched to an HPC cluster design that works for you. Specifically, we will address how to select the right node types, interconnects, operating system, and data storage based on your needs.