Managing clusters can be a daunting task
Higher education and research institutes around the globe are investing in HPC clusters, yet there is an all-too-common oversight during the product acquisition process… they’re not investing in the additional, dedicated man-power it takes to maintain, monitor, and update their clusters. To the grad students and post-docs who need the clusters for their research – and also end up being the ones in charge of maintaining and fixing them – clusters are just big, black boxes. The thought of updating the OFED drivers or troubleshooting MCE errors can too often become their impossible task.
Learn to leverage the deep technical insight of qualified cluster vendors willing to work with you
Damien Tourret, PhD, is a postdoctoral fellow for the Center for Interdisciplinary Research on Complex Systems at Northeastern University’s Boston campus. Recently, he was asked how their cluster has been operating and how his team, led by Distinguished Professor Alain Karma, PhD, has managed to keep the cluster running despite the lack of a cluster administrator. Following an interview with a server technology specialist here is what Damian said:
Interviewer-Tell me about yourself and how you use the cluster
Tourret: Yes, I’m a user of this cluster. But Ari Adland, I and other grad students are basically also in charge of it. We installed it, and thanks to the interaction we’ve had with our cluster vendor we just basically plugged it in. We’re not computer scientists, we are not HPC experts, we know how to utilize HPC but we’re not that knowledgeable about managing the cluster. Those in charge of cluster maintenance are just grad students and postdocs with backgrounds in physics, material sciences, and biomedical research. We want to spend our time on the cluster doing research, not cluster management.
Our cluster vendor relationship and interaction has allowed us to do our own clustering here while focusing on our research. Basically, we just plugged our cluster in and exchanged a few emails here and there with the vendor to solve problems. Whether it’s GPUs or CPUs, what matters to us is the speedup that we can achieve to solving the problems, more than considerations about the architecture. To most users in here, clusters are big, black boxes.
Interviewer-Do you have any specific cluster problems?
Tourret: A few problems. One was related to memory slots failure, and we had to have the motherboard replaced on one of the nodes at some point. Two DIMM slots were just not being seen by the motherboard and couldn’t run the jobs properly. That’s a problem we could not diagnose ourselves.
When we reached that level of problem our cluster provider quickly responded to our needs. When we can’t diagnose it ourselves, like a big problem with the motherboard or even a small update to the software, we’re not able to fix them like they can. We simply email Bart, one of our vendor’s HPC engineers, and our problem gets diagnosed and fixed within the day. That’s what we appreciate most.
Interviewer: Have you worked with other system integrators, and how was their support?
Tourret: I have not. That was the first time I’ve personally been in charge of a cluster. I’ve used them before, but they were managed by the IT people. Honestly, when I first arrived here, I was told “Ok you’re going to install this and that” and I asked myself, can I do that? That’s where our cluster vendor has helped a lot because I could not have done it on my own.
At one point we had a performance problem we could not understand at all. We explained the problem to our provider. They felt our code needed more memory channels for better performance and suggested a new configuration along with more memory modules which they took care of. That fixed our problem, and improved the memory access and speed of our MPI jobs.
Over the past year and a half the questions we had were like, “Hey, how do I fix this,” and the answer was “Try running this command line.” Or a response equally as simple. And even if I don’t know what that command means, it solved the problem quickly and easily most of the time. What more do you need.
Interviewer: Who is your cluster vendor, and is there anything more you would like to say about them?
Tourret: Atipa Technologies is our vendor. I appreciate the collaboration with them so far. Something that I really appreciated was being able to benchmark our programs with them before buying. For instance, when we wanted to try a new GPU or CPU we could just SSH in and test our real codes before making the purchase. When it’s possible, trying before you buy helps guide you to making the best purchase decisions.
A word from Atipa Technologies…
Damien Tourret is not the only one. Thousands of cluster end-users are struggling to keep clusters running on their own. Tourret is, however, one of the lucky ones. With the hindsight to purchase from a vendor who can provide priceless support and advice, as opposed to purchasing from an OEM with multiple support levels that would take days or even weeks to return a solution, his cluster has been running efficiently and reliably since the day it was delivered. In the end, what matters to researchers most is how quickly their simulations can be solved, even if the machines doing the work are mysterious, big, black boxes.
To find out more about Atipa Technologies solutions go to: www.atipa.com
For questions, or to request a quote contact: Dan Mantyla/ email@example.com