May 3, 2011

A Cluster of Queries for HPC Newcomers

Nicole Hemsoth

As potential HPC users consider making a major hardware investment, there are persistent questions that plague pre-cluster conversations. Among such queries is one that sounds simple on the surface yet takes some serious consideration: “Do we try to manage this gear in house or outsource our HPC?”

Probably not a comforting thought, but just as with everything else in high performance computing the easy answer is, “well it all depends.” Still, there are some broad generalities that guide potential users toward the right first questions.

This week HPC services company X-ISS submitted a white paper for general consumption (reg req’d) that provides a top ten-style checklist of questions to ask before investing in in-house HPC. The Houston-based company specializes in HPC solutions, including cluster design, deployment, management and benchmarking and so yes, they have a hefty investment in extolling the benefits of outsourcing. Even still, their points are broad enough to work as a general framework for pre-cluster questioning.

They note that “with the level of complexity involved in an HPC system, having the right person in charge can be the difference between mediocre research and a high productivity machine that is delivering a high return on investment.”

With that in mind, we can skip past some of the more obvious tips (like for example, make sure you actually have people that can administer HPC systems, do a cost-benefit analysis, goals and how long/for what purposes you need the system, having experienced people leads to fewer outages, how many people need trained, etc.) and move to a few more compelling questions.

Among some of the ten advice points is that users need to be aware of what they need out of their reporting and tracking systems. The authors claim that “many infrequent reports on system usage, downtime, performance and productivity, which in many cases is due to inexperience with reporting tools or over-capacity of in-house staff.” While outsourcing might be able to ramp up reporting, even if you’re not weighing the outsource or not-to-outsource question it’s worth pointing to the importance of monitoring and tracking sophistication—how much it will be needed and how to staff/maintain it.

The X-ISS team also broached the issue of repairs. Hard to imagine breakdowns before the system is even installed but other than encouraging outsourcing for this reason, X-ISS notes that there needs to be someone on hand to work with vendors as needed and make sure replacements can be installed and functional.

One of the points X-ISS makes is that users will need to consider the issue of productivity; “how fast do you need your system to be running and in production?” Seems like a straightforward question, but there can be hidden factors, including finding that you’ll have to hire new people, if necessary. They note that “often HPC clusters sit idle due to lack of tools and visibility as well as inexperienced cluster management skills in monitoring and improving system utilization.” They encourage factoring these matters into cost and benefit analysis as well.

Although these are all questions related to the outsourcing decision, for anyone here wondering how HPC might play into their operations, these are all good points to consider in general about HPC systems. It seems to all boil down to people and expertise, as with so many other things. Still take this at face value as general HPC tips–being armed with potential scenarios is a good exercise in seeing the complexity of owning HPC resources in any capacity.

