Taming the Wild Cluster

By Merry Maisel, TACC Science Writer

April 1, 2005

For years, I've written about computational scientific and engineering work carried out on machines called “clusters.” I was aware that clusters offered significant price/performance advantages for many highly parallel computations. I also knew that many of today's “supercomputers” can also be called clusters.

Recently, I realized that, beyond what I've written above, I actually knew relatively little about clusters. I'm a science writer—the kind of person forever ashamed of his or her vast ignorance of everything—so I decided to learn a bit more. Not surprisingly, where I work–at the Texas Advanced Computing Center (TACC) at The University of Texas at Austin (UT Austin)–it turned out that two of my colleagues knew quite a lot about clusters. Each of them had built and managed several of them, all differing in design, components, size, and purposes.

If you are a scientist or engineer or (maybe more to the point) a graduate student becoming a scientist or engineer, my hope is that this interview with my two knowledge-laden colleagues, looking at clusters from the level of the computer-room floor, will be helpful to you. The questions are mine, and the answers come from Drs. Karl Schulz and William Barth, both members of the High-Performance Computing Group at TACC. Schulz obtained a Ph.D. in aerospace engineering from UT Austin in 1999, and Barth did likewise in 2004.

 


Maisel: What appears to happen is that a science or engineering professor grabs a graduate student and says, “Here is X thousand dollars from the big grant. Please build me a cluster so we can all do our research on it.” Is that the way it goes?

 

Barth and Schulz (in chorus): That's how I got started.

Maisel: So what is a cluster?

Schulz: Just a group of machines treated as one. I think you just have to pick a definition and run with it. Lately, people have been using the term “commodity-based cluster.” You could argue that all big supercomputers have always been “clusters” of smaller pieces.

Barth: The Cray T3E could have been considered a cluster, in fact. It just had a really nice interconnect and a very nice package. The central processor chip was basically a commodity processor, an Alpha with a few changes made by Cray.

Maisel: Tell me how many actual clusters each of you has worked on.

Barth: I worked in the CFDlab at UT, led by Professor Graham Carey, and while I was a grad student, we built three clusters—one in 1997, one in 1998, and one in 2000. I was not fully on the TACC staff when Lonestar, our 1024-Xeon-processor Dell cluster, was built, but I've been involved in the maintenance on Lonestar, and now I'm working on our newest Dell cluster, Wrangler, which has Intel 64-bit Xeon processors.

Schulz: I built my first cluster in 1996. Our lab got involved with them after MPI came out as a standard, and we were just hooking up scientific workstations. Ours were IBM RS-6000s. Then we switched over to Linux boxes when we realized the price-performance advantage. After I got my degree I spent some time in the commercial software world, where we were doing the same thing: we all had Linux boxes and we always needed to chain them together, so we always had some form of Linux cluster. Then I joined TACC and was involved with deployment of Lonestar, and now Wrangler.

Maisel: Are the boxes in a cluster all alike?

Barth: Pretty much—usually some multiple-of-two number of machines, plus one log-in node. That one is usually a bit more powerful, with more disk or maybe some more RAM. But all the rest are usually identical, which makes it easier to build.

Schulz: The log-in node or head node is usually configured with a superset of what is on the other nodes. It may be the place where the batch scheduling system runs; it allows the users to log in and compile codes. Whether compilers are installed on the rest of the nodes is compiler-dependent—some of them have libraries that need to be on all nodes, so the compilers may have to be installed there.

Our approach at TACC is to treat all the nodes as independent entities, and we chain them together with multiple interconnects. One is usually an expensive, high-speed interconnect—Myrinet, Infiniband, Quadrics, or something like that—and we usually have some other network as well, such as gigabit Ethernet. But each node has a separate operating system and some minimum set of software, so the users can actually get on and do something productive. So one of the issues that arises—the need to “tame” these clusters—is that so many compute nodes can easily get out of sync with each other.  It's very difficult to diagnose what's wrong if a user submits a job and only one node out of 500 is missing the particular library that he or she needs. It isn't always obvious why the job is failing. So there is a real management problem in just keeping everything up to sync, which is why there are so many different cluster toolkits.

Maisel: Such as Rocks?

Schulz: NPACI Rocks is one of them, indeed. We have been running various flavors of Rocks at TACC. It sits on the head node, and when we install a new node, it talks to the master node and says, “hey, give me the distribution,” and it gets a copy of the same software that is on all the other nodes. Typically, on a small cluster—and you can argue about what is actually “small” these days—all the nodes will be identical.

Barth: But once your cluster starts to get big, say, up in the thousands of processors, you're generally going to have different entities. You'll have compute nodes, on which users will do scientific analysis, and you may also have others that only do some administrative task, such as I/O or accounting for running jobs. In the case of Lonestar and Wrangler, for example, we have a parallel file system product, Ibrix, which attaches a separate set of nodes to a high-speed, high-performance disk system networked to the compute nodes and login node.

Maisel: Clearly, a “supercomputer” cluster is more complicated, then, than the kind you used to build. What were those like?

Barth: The first machine we built in the computational fluid dynamics lab had 16 nodes and a separate log-in node, with extra disk and more memory, and they were all connected on a single Fast Ethernet network. We ran two separate codes on it, and we did all the operating system installations by hand, from a CD. Actually, we installed one, got it set up the way we wanted it, then cloned the disk, and then went from two systems to four, eight, and all sixteen, geometrically and in parallel. Then we made a handful of changes on each node to give it a different identity, a different IP address and name. That was a pain, but we only did it once, and we had so few users that we didn't need a queuing system.

Maisel: So you just yelled when you wanted the machine?

Barth: Shouted across the room. It wasn't until our third system that we had enough users to justify a batch queuing system, so we used PBS. We were still only running two or three different codes, but we had half a dozen or so users. By that time we had taken all the disks out of the compute nodes and they were getting files over the network from the log-in node, which had a lot of disk space in it. That's a scalable solution out to about 32 nodes, but after that you'd need something more distributed.

Schulz: My experience was much the same. We both came from engineering backgrounds, and such students get sucked into this quite easily, because there isn't a lot of formal training. You end up with a cluster because of the great price/performance ratio. As long as you're only building a small cluster, it's really fairly tractable. You're forced into learning things about keeping them all up to date. A lab system is easy to manage at 16 or 32 nodes. I've never done a 64-node system, but it probably isn't that bad. Once the cluster gets any bigger, there will be special challenges.

Maisel: How should the size of a cluster relate to the number of users or number of codes being employed?

Schulz: Usually the size is determined by the amount of money available: you want to buy the biggest system you can afford. Commercial companies might buy them for only one or two users, or for only one application.

Barth: There are certainly oil companies and the like that have a 2000-processor cluster running only one code—just six or eight engineers in a group might be working with the same code, day in and day out, usually seismic inverse modeling—trying to figure out what the ground is made of by listening to it.

Maisel: So what advice would you give to the grad student who's asked to build a cluster today?

Schulz: Run! Run while you can, go out and get a real job—you'll be stuck doing clusters forever! If that's not an option, though, I'd advise the research group to think about the kind of arrangement we've made here with Wrangler. That cluster is shared by a group in the Astronomy Department at UT Austin, the UT Institute of Geology, and the Bureau of Economic Geology.

Barth: I'd advise anyone with a sizable cluster at least to manage it with a toolkit like Rocks. Clusters in labs tend to get customized for their very small user base, and they're often not upgraded very much. At TACC, on the other hand, we need to maintain a very generic system, and upgrades are much more frequent.

Schulz: There are at least two major problems that come with upgrades.  One is that, with a large commodity cluster, you have multiple vendors involved. Maybe the nodes come from one source, a high-speed interconnect from another, a batch system from a third. If it's a really large cluster, you'll need some sort of parallel file system, from yet another vendor. Anytime you have that many vendors, one difficulty is simply getting them to cooperate with each other. The second problem occurs if you're at the edge—constantly building larger and larger clusters. Invariably, there are scalability problems with every single thing you add, whether that's a batch system, a switch, or a parallel file system.

Barth: At this point, TACC has a larger Dell cluster than Dell does on its own floor, bigger machines than Platform or Ibrix have in either of their test facilities. So we end up becoming the beta site for many of the vendors.

Maisel: So when should a computational scientist build a cluster and when should he or she go in on a cluster with others or have a large cluster managed by an IT department or a place like TACC, an advanced computing center?

Schulz: Unless a group is running a proprietary code or doing research on clusters per se, they're best advised to share a cluster and get it more professionally managed, I think. For one thing, they can use a larger machine by going in with other users. No-one can argue about the price/performance advantage with clusters, but what you get out of them is what you pay for them. And if you factor in the cost to deploy and manage clusters, it's still very attractive, but most small groups overlook that cost.

Barth: It isn't easy to manage a really large cluster because there are so many components that can fail. The mean-time-to-failure of a machine with a thousand processors is going to be on the order of once a week for some items. Such a machine needs a dedicated hardware staff. If something has a 300,000-hour lifetime or mean-time-to-failure and you have a thousand of them, then one will fail every 300 hours, which is 12 or 13 days. And that's the amount of time many of these things are rated for—disk drives, power supplies, and so on.

Maisel: Speaking of power supplies, what about power and cooling issues?

Barth: That's definitely a consideration. Our CFD lab had to have power pulled for the last system we built.

Schulz: And you can't go much above 16 or 32 nodes in a closet or rack. After that, you might need a cooled room, dedicated power, a raised floor. As clock speeds on the chips go up, the amount of heat we have to deal with increases exponentially. We see this at TACC all the time.

Barth: In fact, one advantage I can see of having a smaller system in your own lab versus going in on a larger system in a place like TACC might be the speed of your connection to the machine. But that's something that is being rapidly overcome as more universities install fast networks. Another advantage of the smaller lab cluster might be the access level—

Maisel: Meaning?

Schulz: You can kill your buddy's job if you're more important!

Barth: Precisely. You can directly prioritize tasks. If something needs to be run for a paper due this week and someone else is toying around with some problem, you can kick them off until your runs are done. On the other hand, we do try to accommodate such priorities at TACC. We can make reservations for things, but we need to know in advance.

Maisel: What is the barrier to on-demand computing in a large cluster?

Barth: The major barrier is the lack, on the hardware and OS side, of reasonable facilities for checkpointing and restarting codes. The only really good checkpoint/restart facilities in my memory were on the Cray T3E. The operating system could do this without even talking to the running application. It could just take a snapshot of its memory, and it knew about any messages that were in flight on the network. You could literally migrate jobs around in the 3D torus topology of the Cray network. If some area of the machine was having a problem, you could migrate running jobs: checkpoint them, move them to another area, and restart them. The runtime was a little longer than it would have been otherwise, but you could take a misbehaving portion of a machine out of the loop without killing jobs.

Schulz: Nothing runs that way on the commodity clusters. We certainly hope checkpoint/restart will be solved in the cluster world at some point; that lack is one of the big downsides to large clusters. Since lots of things will fail, someone's job will die with it, and at the moment, such a job is not recoverable. Of course, in the scientific community, because people are used to this, most people checkpoint their own code—dump a file now and then so everything is not lost when something on the cluster fails. That's rudimentary checkpointing, though—nothing like monitoring what messages are in flight. There are just too many players involved in the cluster community, too many vendors: with the chip produced here, the operating system there, the interconnect by someone else—no wonder checkpoint/restart hasn't been available!

Barth: And no wonder Cray could do it. Cray wrote the OS, wrote the MPI, wrote the other communication libraries; they built the interconnect and the processors and tested them all together and developed them all together. Whereas you have any number of Linux vendors and any number of MPI flavors for clusters.

Maisel: So there's a lot to think about when deciding on buying a cluster?

Schulz: Should you buy your own or should you seek professional help? Wrangler has buy-in from groups of academic researchers. Why would they share one when they can buy their own and do whatever they want with it? But it is actually pretty attractive to go in with your compatriots to buy a bigger machine. Suppose you have X dollars and you go get the largest cluster you can afford, say 16 processors. At max, then, you can run a 16-processor job 24 hours a day. But if you go in with others and buy a larger cluster, you can run a bigger job and get it turned around faster.

Add to that the fact that, in the academic research world, you and your research group are not always in the stage of running production computing jobs. You're developing new algorithms or trying new ideas, so you aren't running all the time. If it's your own cluster, that time is just gone, evaporated. You never get it back. But if you're in a pool with some fixed amount of CPU time on a larger cluster, you can actually get more throughput and waste fewer hours.

Barth: Wrangler does have a big advantage in better memory access. Each processor in a node on Wrangler has access to the entire 4 gigabyte node memory, and the frontside bus runs at 800 MHz, while the Lonestar bus is 533 MHz. For jobs that are memory-bandwidth-limited, that could work out to a 33 percent faster run on Wrangler.

Schulz: The only problem is that big clusters require some expertise. Bill and I got involved as graduate students–we were sort of forced into it–which is how most people pick up these skills. But there's some threshold where student system management makes no sense at all, as you go from 16 to 1000 processors. If you pool your money with another group and someone like TACC, for example, will help administer it, a large cluster can be a big win for all the researchers and all the students, too. Stuff does go wrong, and it may not be obvious that there's a bad line card in a switch or that the bisection bandwidth is a tenth of what it should be–no lights will blink, no e-mail will be sent to you, and it will take a professional to troubleshoot.

Maisel: Is the ability to troubleshoot–this kind of tacit knowledge–something that can be taught formally in engineering or computer science?

Barth: Not nearly enough, I think. At TACC, we have taught a half-day beginning cluster installation course as part of our Introduction to Parallel Programming course, which we give regularly, several times a year. We do know, from our consulting work, that graduate students just beginning to use clusters do not yet have the skills to be effective users.

Schulz: Not because they're not very intelligent, but because there's no class they can take that gives them the Unix skills. You can get training for the Cray X-1 and get certified; I've even heard of general-purpose Linux certification, but I don't know about a general, multi-vendor cluster certification.  People start like we did, with a 16-processor cluster. Then they think, wow, 16 is cool, but think what we could do with 64! You learn, and you get burned–we do even now, and we've done it a long time. Clusters are the workingman's supercomputer. Is it an interconnect issue or a disk problem? Like an old hot rod enthusiast, you just go at. It has an awesome engine, but it needs a brake job pretty frequently.

Barth: A student has to avoid becoming a system administrator. Heck, I barely got my research done and my thesis written. A student who decides to improve the department's computational capability, or who is asked to do so, will definitely hear this: “Stop doing that! You need to do your research! Stop working on the system and do your research!”

Maisel: “But how can I do my research if the system isn't up?”

Barth: Exactly! I think that's the bottom line here. The world of advanced computation changes all the time, and it takes an advanced center to stay ahead of the curve because the expertise that such centers have is their most valuable asset. We sure think so here, and we're focusing all of that on getting the job done for researchers at UT, nationwide on the TeraGrid because we're a partner in that, and worldwide through our international alliances.


Merry joined TACC as a Science Writer in October 2003. She speaks with users of TACC resources and summarizes what they do scientifically and computationally, which benefits other users, an audience composed of a scientifically literate public. Merry worked as a science writer at San Diego Supercomputer Center (SDSC) for 18 years before coming to TACC.

For more on Barth and Schulz, including pictures, visit http://www.tacc.utexas.edu/general/staff/

To learn even more about clusters, look into attending the 6th LCI International Conference on Linux Clusters: The HPC Revolution 2005.  More info can be found at http://www.linuxclustersinstitute.org/Linux-HPC-Revolution/

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

2024 Winter Classic: The Return of Team Fayetteville

April 18, 2024

Hailing from Fayetteville, NC, Fayetteville State University stayed under the radar in their first Winter Classic competition in 2022. Solid students for sure, but not a lot of HPC experience. All good. They didn’t Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use of Rigetti’s Novera 9-qubit QPU. The approach by a quantum Read more…

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire