In the previous Cluster Lifecycle Management column, I discussed best practices for choosing the right vendor to build the cluster that meets your needs. Once your team has selected a vendor and finalized the purchase of your new system, the next crucial step is deploying and validating the HPC cluster.
As part of the vendor selection process, I recommended you ask the candidates to include deployment and validation services in their price proposals. Most, if not all, HPC system providers are prepared to install the hardware and software they sell and make sure everything runs correctly. In some cases, they may subcontract some of this work out to experienced HPC professionals.
If you have an HPC-savvy IT department, performing the deployment yourself can save money, but you should give this option considerable thought. In too many situations, I’ve seen an expensive cluster sit idle while an unexpected problem is worked out. This downtime can wipe out any cost savings you hoped to incur by deploying with internal staff.
Once the deployment responsibilities have been settled and the final purchase contract has been signed, your HPC selection team has not completed its work. The same group of internal stakeholders must now turn its attention to the deployment and validation phases.
The first order of business is for the team to prepare the facility that will soon house the cluster. Your team must make sure the chosen site has adequate space to accommodate the hardware. There must also be sufficient and reliable power to run the system and air conditioning systems to keep it cool. It may turn out that the local facility is not adequate; if so a co-lo facility will also have to be selected. While looking at power, pay special attention to provide the exact voltage/phase/plug specs to the vendor. Providing the right power connection cables is usually the responsibility of the vendor, but they need the correct specifications in advance. If the site is ready for installation, deployment can begin.
For the larger cluster, HPC vendors will typically ‘rack and stack’ the clusters offsite before shipping, which means the racks can easily be rolled into your facility and put in their chosen spots. The racks will then be connected to each other and to the power source. An important step during installation is to label equipment and cable connections so they can be readily identified.
The vendor then powers up the hardware to make sure each component is functioning and performs a burn-in. Either onsite or prior to shipping, the vendor confirms the equipment has the most up-to-date supported bios and firmware for the various system components.
The next crucial phase of deployment is loading the operating system and software. Most vendors will use a cluster management system package to get the operating system and the HPC software stack deployed on the nodes. Specifically, this specialized software ensures compute nodes are set up consistently so they start properly from the operating system and all contain an identical software stack. If nodes are set up with consistent images and have connectivity to the head node, they all look the same to the applications that will run on them. It also significantly reduces the time to redeploy a node if needed or to deploy changes to the software.
The deployment may also involve setting up any external storage that is needed by the cluster. Finally, the appropriate scheduler and applications need to be properly configured and deployed.
The cluster is now ready for basic validation. The vendor runs a suite of software specifically designed to test the nodes and the cluster, usually High Performance LINPACK (HPL). Alternatively these suites may be manufacturer-specific and shipped with the HPC stacks. For example, Intel offers its own cluster-specific requirements program called Intel Cluster Ready. Online validation applications are also available. In addition, tests may be requested by you and designed specifically for your particular use cases and applications.
Typically, the test first validates the nodes are functioning individually and then confirms they are operating together as a cluster. Some of the problems that may be identified during basic validation are memory issues within specific nodes or interconnect errors between nodes. Tools may be included in the suite to test each interconnect and data storage drive.
Often vendors complete their validation tests at this point. But I recommend additional validation and even benchmarking as part of the process. Once the basic cluster validation tests above have been completed, it is critical that the application setup be tested by having it submit jobs through the scheduler to make sure it will run on the cluster from end to end. Only at this point can you be confident your new cluster is fully up and running.
Some vendors will go the extra mile and perform benchmark tests to determine how efficiently the cluster is operating. An HPL benchmark, for example, measures the speed with which the HPC system arrives at solutions while performing real calculations. The results serve as a performance baseline for the cluster, and some vendors take advantage of this information to tweak the system – modify various settings – to squeeze greater power and speed from it.
Assuming issues were resolved during the validation and benchmark scores are acceptable, your HPC cluster is now ready for operation.
The total time required for deployment and validation will vary with the size of the system. A smaller cluster comprised of 16 or 32 nodes, for instance, may take a week to become operational, while a 200- to 300-node system may require a month or two depending on the complexity of the overall configuration and acceptance test requirements. These times may be shorter if the vendor performs much of the work offsite before delivery and installation at your facility.
Members of your internal IT group should be on hand during deployment and validation for several reasons. Ensuring the process proceeds as planned is important, and the team will understand how equipment is connected if they observe without getting in the way of the vendors. Most vendors are willing to have a ‘knowledge transfer’ session, but this is best left for the completion of deployment and validation so as not to slow down progress while work is underway.
Now it’s time to put your new HPC cluster to use and make sure it continues operating efficiently. I’ll cover “Proper Care and Feeding of Your Cluster” in the next column.
Deepak Khosla is president and CEO of X-ISS Inc.
Next article in the series.