How to Deploy and Validate Your Cluster

By Deepak Khosla

January 13, 2015

In the previous Cluster Lifecycle Management column, I discussed best practices for choosing the right vendor to build the cluster that meets your needs. Once your team has selected a vendor and finalized the purchase of your new system, the next crucial step is deploying and validating the HPC cluster.

As part of the vendor selection process, I recommended you ask the candidates to include deployment and validation services in their price proposals. Most, if not all, HPC system providers are prepared to install the hardware and software they sell and make sure everything runs correctly. In some cases, they may subcontract some of this work out to experienced HPC professionals.

If you have an HPC-savvy IT department, performing the deployment yourself can save money, but you should give this option considerable thought. In too many situations, I’ve seen an expensive cluster sit idle while an unexpected problem is worked out. This downtime can wipe out any cost savings you hoped to incur by deploying with internal staff.

Once the deployment responsibilities have been settled and the final purchase contract has been signed, your HPC selection team has not completed its work. The same group of internal stakeholders must now turn its attention to the deployment and validation phases.

The first order of business is for the team to prepare the facility that will soon house the cluster. Your team must make sure the chosen site has adequate space to accommodate the hardware. There must also be sufficient and reliable power to run the system and air conditioning systems to keep it cool. It may turn out that the local facility is not adequate; if so a co-lo facility will also have to be selected. While looking at power, pay special attention to provide the exact voltage/phase/plug specs to the vendor. Providing the right power connection cables is usually the responsibility of the vendor, but they need the correct specifications in advance. If the site is ready for installation, deployment can begin.

For the larger cluster, HPC vendors will typically ‘rack and stack’ the clusters offsite before shipping, which means the racks can easily be rolled into your facility and put in their chosen spots. The racks will then be connected to each other and to the power source. An important step during installation is to label equipment and cable connections so they can be readily identified.

The vendor then powers up the hardware to make sure each component is functioning and performs a burn-in. Either onsite or prior to shipping, the vendor confirms the equipment has the most up-to-date supported bios and firmware for the various system components.

The next crucial phase of deployment is loading the operating system and software. Most vendors will use a cluster management system package to get the operating system and the HPC software stack deployed on the nodes. Specifically, this specialized software ensures compute nodes are set up consistently so they start properly from the operating system and all contain an identical software stack. If nodes are set up with consistent images and have connectivity to the head node, they all look the same to the applications that will run on them. It also significantly reduces the time to redeploy a node if needed or to deploy changes to the software.

The deployment may also involve setting up any external storage that is needed by the cluster. Finally, the appropriate scheduler and applications need to be properly configured and deployed.

The cluster is now ready for basic validation. The vendor runs a suite of software specifically designed to test the nodes and the cluster, usually High Performance LINPACK (HPL). Alternatively these suites may be manufacturer-specific and shipped with the HPC stacks. For example, Intel offers its own cluster-specific requirements program called Intel Cluster Ready. Online validation applications are also available. In addition, tests may be requested by you and designed specifically for your particular use cases and applications.

Typically, the test first validates the nodes are functioning individually and then confirms they are operating together as a cluster. Some of the problems that may be identified during basic validation are memory issues within specific nodes or interconnect errors between nodes. Tools may be included in the suite to test each interconnect and data storage drive.

Often vendors complete their validation tests at this point. But I recommend additional validation and even benchmarking as part of the process. Once the basic cluster validation tests above have been completed, it is critical that the application setup be tested by having it submit jobs through the scheduler to make sure it will run on the cluster from end to end. Only at this point can you be confident your new cluster is fully up and running.

Some vendors will go the extra mile and perform benchmark tests to determine how efficiently the cluster is operating. An HPL benchmark, for example, measures the speed with which the HPC system arrives at solutions while performing real calculations. The results serve as a performance baseline for the cluster, and some vendors take advantage of this information to tweak the system – modify various settings – to squeeze greater power and speed from it.

Assuming issues were resolved during the validation and benchmark scores are acceptable, your HPC cluster is now ready for operation.

The total time required for deployment and validation will vary with the size of the system. A smaller cluster comprised of 16 or 32 nodes, for instance, may take a week to become operational, while a 200- to 300-node system may require a month or two depending on the complexity of the overall configuration and acceptance test requirements. These times may be shorter if the vendor performs much of the work offsite before delivery and installation at your facility.

Members of your internal IT group should be on hand during deployment and validation for several reasons. Ensuring the process proceeds as planned is important, and the team will understand how equipment is connected if they observe without getting in the way of the vendors. Most vendors are willing to have a ‘knowledge transfer’ session, but this is best left for the completion of deployment and validation so as not to slow down progress while work is underway.

Now it’s time to put your new HPC cluster to use and make sure it continues operating efficiently. I’ll cover “Proper Care and Feeding of Your Cluster” in the next column.

Deepak Khosla is president and CEO of X-ISS Inc.

Next article in the series.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Quantum Software Specialist Q-CTRL Inks Deals with IBM, Rigetti, Oxford, and Diraq

September 10, 2024

Q-CTRL, the Australia-based start-up focusing on quantum infrastructure software, today announced that its performance-management software, Fire Opal, will be natively integrated into four of the world's most advanced qu Read more…

Computing-Driven Medicine: Sleeping Better with HPC

September 10, 2024

As a senior undergraduate student at Fisk University in Nashville, Tenn., Ifrah Khurram's calculus professor, Dr. Sanjukta Hota, encouraged her to apply for the Sustainable Research Pathways Program (SRP). SRP was create Read more…

LLNL Engineers Harness Machine Learning to Unlock New Possibilities in Lattice Structures

September 9, 2024

Lattice structures, characterized by their complex patterns and hierarchical designs, offer immense potential across various industries, including automotive, aerospace, and biomedical engineering. With their outstand Read more…

NSF-Funded Data Fabric Takes Flight

September 5, 2024

The data fabric has emerged as an enterprise data management pattern for companies that struggle to provide large teams of users with access to well-managed, integrated, and secured data. Now scientists working at univer Read more…

xAI Colossus: The Elon Project

September 5, 2024

Elon Musk's xAI cluster, named Colossus (possibly after the 1970 movie about a massive computer that does not end well), has been brought online. Musk recently posted the following on X/Twitter: "This weekend, the @xA Read more…

Researchers Benchmark Nvidia’s GH200 Supercomputing Chips

September 4, 2024

Nvidia is putting its GH200 chips in European supercomputers, and researchers are getting their hands on those systems and releasing research papers with performance benchmarks. In the first paper, Understanding Data Mov Read more…

Quantum Software Specialist Q-CTRL Inks Deals with IBM, Rigetti, Oxford, and Diraq

September 10, 2024

Q-CTRL, the Australia-based start-up focusing on quantum infrastructure software, today announced that its performance-management software, Fire Opal, will be n Read more…

NSF-Funded Data Fabric Takes Flight

September 5, 2024

The data fabric has emerged as an enterprise data management pattern for companies that struggle to provide large teams of users with access to well-managed, in Read more…

Shutterstock 1024337068

Researchers Benchmark Nvidia’s GH200 Supercomputing Chips

September 4, 2024

Nvidia is putting its GH200 chips in European supercomputers, and researchers are getting their hands on those systems and releasing research papers with perfor Read more…

Shutterstock 1897494979

What’s New with Chapel? Nine Questions for the Development Team

September 4, 2024

HPC news headlines often highlight the latest hardware speeds and feeds. While advances on the hardware front are important, improving the ability to write soft Read more…

Critics Slam Government on Compute Speeds in Regulations

September 3, 2024

Critics are accusing the U.S. and state governments of overreaching by including limits on compute speeds in regulations and laws, which they claim will limit i Read more…

Shutterstock 1622080153

AWS Perfects Cloud Service for Supercomputing Customers

August 29, 2024

Amazon's AWS believes it has finally created a cloud service that will break through with HPC and supercomputing customers. The cloud provider a Read more…

HPC Debrief: James Walker CEO of NANO Nuclear Energy on Powering Datacenters

August 27, 2024

Welcome to The HPC Debrief where we interview industry leaders that are shaping the future of HPC. As the growth of AI continues, finding power for data centers Read more…

CEO Q&A: Acceleration is Quantinuum’s New Mantra for Success

August 27, 2024

At the Quantum World Congress (QWC) in mid-September, trapped ion quantum computing pioneer Quantinuum will unveil more about its expanding roadmap. Its current Read more…

Everyone Except Nvidia Forms Ultra Accelerator Link (UALink) Consortium

May 30, 2024

Consider the GPU. An island of SIMD greatness that makes light work of matrix math. Originally designed to rapidly paint dots on a computer monitor, it was then Read more…

Atos Outlines Plans to Get Acquired, and a Path Forward

May 21, 2024

Atos – via its subsidiary Eviden – is the second major supercomputer maker outside of HPE, while others have largely dropped out. The lack of integrators and Atos' financial turmoil have the HPC market worried. If Atos goes under, HPE will be the only major option for building large-scale systems. Read more…

AMD Clears Up Messy GPU Roadmap, Upgrades Chips Annually

June 3, 2024

In the world of AI, there's a desperate search for an alternative to Nvidia's GPUs, and AMD is stepping up to the plate. AMD detailed its updated GPU roadmap, w Read more…

Nvidia Shipped 3.76 Million Data-center GPUs in 2023, According to Study

June 10, 2024

Nvidia had an explosive 2023 in data-center GPU shipments, which totaled roughly 3.76 million units, according to a study conducted by semiconductor analyst fir Read more…

Shutterstock_1687123447

Nvidia Economics: Make $5-$7 for Every $1 Spent on GPUs

June 30, 2024

Nvidia is saying that companies could make $5 to $7 for every $1 invested in GPUs over a four-year period. Customers are investing billions in new Nvidia hardwa Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Google Announces Sixth-generation AI Chip, a TPU Called Trillium

May 17, 2024

On Tuesday May 14th, Google announced its sixth-generation TPU (tensor processing unit) called Trillium.  The chip, essentially a TPU v6, is the company's l Read more…

Shutterstock 1024337068

Researchers Benchmark Nvidia’s GH200 Supercomputing Chips

September 4, 2024

Nvidia is putting its GH200 chips in European supercomputers, and researchers are getting their hands on those systems and releasing research papers with perfor Read more…

Leading Solution Providers

Contributors

IonQ Plots Path to Commercial (Quantum) Advantage

July 2, 2024

IonQ, the trapped ion quantum computing specialist, delivered a progress report last week firming up 2024/25 product goals and reviewing its technology roadmap. Read more…

Intel’s Next-gen Falcon Shores Coming Out in Late 2025 

April 30, 2024

It's a long wait for customers hanging on for Intel's next-generation GPU, Falcon Shores, which will be released in late 2025.  "Then we have a rich, a very Read more…

Some Reasons Why Aurora Didn’t Take First Place in the Top500 List

May 15, 2024

The makers of the Aurora supercomputer, which is housed at the Argonne National Laboratory, gave some reasons why the system didn't make the top spot on the Top Read more…

Department of Justice Begins Antitrust Probe into Nvidia

August 9, 2024

After months of skyrocketing stock prices and unhinged optimism, Nvidia has run into a few snags – a  design flaw in one of its new chips and an antitrust pr Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

MLPerf Training 4.0 – Nvidia Still King; Power and LLM Fine Tuning Added

June 12, 2024

There are really two stories packaged in the most recent MLPerf  Training 4.0 results, released today. The first, of course, is the results. Nvidia (currently Read more…

Spelunking the HPC and AI GPU Software Stacks

June 21, 2024

As AI continues to reach into every domain of life, the question remains as to what kind of software these tools will run on. The choice in software stacks – Read more…

Quantum Watchers – Terrific Interview with Caltech’s John Preskill by CERN

July 17, 2024

In case you missed it, there's a fascinating interview with John Preskill, the prominent Caltech physicist and pioneering quantum computing researcher that was Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire