For years, I've written about computational scientific and engineering work carried out on machines called “clusters.” I was aware that clusters offered significant price/performance advantages for many highly parallel computations. I also knew that many of today's “supercomputers” can also be called clusters.
Recently, I realized that, beyond what I've written above, I actually knew relatively little about clusters. I'm a science writerthe kind of person forever ashamed of his or her vast ignorance of everythingso I decided to learn a bit more. Not surprisingly, where I work–at the Texas Advanced Computing Center (TACC) at The University of Texas at Austin (UT Austin)–it turned out that two of my colleagues knew quite a lot about clusters. Each of them had built and managed several of them, all differing in design, components, size, and purposes.
If you are a scientist or engineer or (maybe more to the point) a graduate student becoming a scientist or engineer, my hope is that this interview with my two knowledge-laden colleagues, looking at clusters from the level of the computer-room floor, will be helpful to you. The questions are mine, and the answers come from Drs. Karl Schulz and William Barth, both members of the High-Performance Computing Group at TACC. Schulz obtained a Ph.D. in aerospace engineering from UT Austin in 1999, and Barth did likewise in 2004.
Maisel: What appears to happen is that a science or engineering professor grabs a graduate student and says, “Here is X thousand dollars from the big grant. Please build me a cluster so we can all do our research on it.” Is that the way it goes?
Barth and Schulz (in chorus): That's how I got started.
Maisel: So what is a cluster?
Schulz: Just a group of machines treated as one. I think you just have to pick a definition and run with it. Lately, people have been using the term “commodity-based cluster.” You could argue that all big supercomputers have always been “clusters” of smaller pieces.
Barth: The Cray T3E could have been considered a cluster, in fact. It just had a really nice interconnect and a very nice package. The central processor chip was basically a commodity processor, an Alpha with a few changes made by Cray.
Maisel: Tell me how many actual clusters each of you has worked on.
Barth: I worked in the CFDlab at UT, led by Professor Graham Carey, and while I was a grad student, we built three clustersone in 1997, one in 1998, and one in 2000. I was not fully on the TACC staff when Lonestar, our 1024-Xeon-processor Dell cluster, was built, but I've been involved in the maintenance on Lonestar, and now I'm working on our newest Dell cluster, Wrangler, which has Intel 64-bit Xeon processors.
Schulz: I built my first cluster in 1996. Our lab got involved with them after MPI came out as a standard, and we were just hooking up scientific workstations. Ours were IBM RS-6000s. Then we switched over to Linux boxes when we realized the price-performance advantage. After I got my degree I spent some time in the commercial software world, where we were doing the same thing: we all had Linux boxes and we always needed to chain them together, so we always had some form of Linux cluster. Then I joined TACC and was involved with deployment of Lonestar, and now Wrangler.
Maisel: Are the boxes in a cluster all alike?
Barth: Pretty muchusually some multiple-of-two number of machines, plus one log-in node. That one is usually a bit more powerful, with more disk or maybe some more RAM. But all the rest are usually identical, which makes it easier to build.
Schulz: The log-in node or head node is usually configured with a superset of what is on the other nodes. It may be the place where the batch scheduling system runs; it allows the users to log in and compile codes. Whether compilers are installed on the rest of the nodes is compiler-dependentsome of them have libraries that need to be on all nodes, so the compilers may have to be installed there.
Our approach at TACC is to treat all the nodes as independent entities, and we chain them together with multiple interconnects. One is usually an expensive, high-speed interconnectMyrinet, Infiniband, Quadrics, or something like thatand we usually have some other network as well, such as gigabit Ethernet. But each node has a separate operating system and some minimum set of software, so the users can actually get on and do something productive. So one of the issues that arisesthe need to “tame” these clustersis that so many compute nodes can easily get out of sync with each other. It's very difficult to diagnose what's wrong if a user submits a job and only one node out of 500 is missing the particular library that he or she needs. It isn't always obvious why the job is failing. So there is a real management problem in just keeping everything up to sync, which is why there are so many different cluster toolkits.
Maisel: Such as Rocks?
Schulz: NPACI Rocks is one of them, indeed. We have been running various flavors of Rocks at TACC. It sits on the head node, and when we install a new node, it talks to the master node and says, “hey, give me the distribution,” and it gets a copy of the same software that is on all the other nodes. Typically, on a small clusterand you can argue about what is actually “small” these daysall the nodes will be identical.
Barth: But once your cluster starts to get big, say, up in the thousands of processors, you're generally going to have different entities. You'll have compute nodes, on which users will do scientific analysis, and you may also have others that only do some administrative task, such as I/O or accounting for running jobs. In the case of Lonestar and Wrangler, for example, we have a parallel file system product, Ibrix, which attaches a separate set of nodes to a high-speed, high-performance disk system networked to the compute nodes and login node.
Maisel: Clearly, a “supercomputer” cluster is more complicated, then, than the kind you used to build. What were those like?
Barth: The first machine we built in the computational fluid dynamics lab had 16 nodes and a separate log-in node, with extra disk and more memory, and they were all connected on a single Fast Ethernet network. We ran two separate codes on it, and we did all the operating system installations by hand, from a CD. Actually, we installed one, got it set up the way we wanted it, then cloned the disk, and then went from two systems to four, eight, and all sixteen, geometrically and in parallel. Then we made a handful of changes on each node to give it a different identity, a different IP address and name. That was a pain, but we only did it once, and we had so few users that we didn't need a queuing system.
Maisel: So you just yelled when you wanted the machine?
Barth: Shouted across the room. It wasn't until our third system that we had enough users to justify a batch queuing system, so we used PBS. We were still only running two or three different codes, but we had half a dozen or so users. By that time we had taken all the disks out of the compute nodes and they were getting files over the network from the log-in node, which had a lot of disk space in it. That's a scalable solution out to about 32 nodes, but after that you'd need something more distributed.
Schulz: My experience was much the same. We both came from engineering backgrounds, and such students get sucked into this quite easily, because there isn't a lot of formal training. You end up with a cluster because of the great price/performance ratio. As long as you're only building a small cluster, it's really fairly tractable. You're forced into learning things about keeping them all up to date. A lab system is easy to manage at 16 or 32 nodes. I've never done a 64-node system, but it probably isn't that bad. Once the cluster gets any bigger, there will be special challenges.
Maisel: How should the size of a cluster relate to the number of users or number of codes being employed?
Schulz: Usually the size is determined by the amount of money available: you want to buy the biggest system you can afford. Commercial companies might buy them for only one or two users, or for only one application.
Barth: There are certainly oil companies and the like that have a 2000-processor cluster running only one codejust six or eight engineers in a group might be working with the same code, day in and day out, usually seismic inverse modelingtrying to figure out what the ground is made of by listening to it.
Maisel: So what advice would you give to the grad student who's asked to build a cluster today?
Schulz: Run! Run while you can, go out and get a real jobyou'll be stuck doing clusters forever! If that's not an option, though, I'd advise the research group to think about the kind of arrangement we've made here with Wrangler. That cluster is shared by a group in the Astronomy Department at UT Austin, the UT Institute of Geology, and the Bureau of Economic Geology.
Barth: I'd advise anyone with a sizable cluster at least to manage it with a toolkit like Rocks. Clusters in labs tend to get customized for their very small user base, and they're often not upgraded very much. At TACC, on the other hand, we need to maintain a very generic system, and upgrades are much more frequent.
Schulz: There are at least two major problems that come with upgrades. One is that, with a large commodity cluster, you have multiple vendors involved. Maybe the nodes come from one source, a high-speed interconnect from another, a batch system from a third. If it's a really large cluster, you'll need some sort of parallel file system, from yet another vendor. Anytime you have that many vendors, one difficulty is simply getting them to cooperate with each other. The second problem occurs if you're at the edgeconstantly building larger and larger clusters. Invariably, there are scalability problems with every single thing you add, whether that's a batch system, a switch, or a parallel file system.
Barth: At this point, TACC has a larger Dell cluster than Dell does on its own floor, bigger machines than Platform or Ibrix have in either of their test facilities. So we end up becoming the beta site for many of the vendors.
Maisel: So when should a computational scientist build a cluster and when should he or she go in on a cluster with others or have a large cluster managed by an IT department or a place like TACC, an advanced computing center?
Schulz: Unless a group is running a proprietary code or doing research on clusters per se, they're best advised to share a cluster and get it more professionally managed, I think. For one thing, they can use a larger machine by going in with other users. No-one can argue about the price/performance advantage with clusters, but what you get out of them is what you pay for them. And if you factor in the cost to deploy and manage clusters, it's still very attractive, but most small groups overlook that cost.
Barth: It isn't easy to manage a really large cluster because there are so many components that can fail. The mean-time-to-failure of a machine with a thousand processors is going to be on the order of once a week for some items. Such a machine needs a dedicated hardware staff. If something has a 300,000-hour lifetime or mean-time-to-failure and you have a thousand of them, then one will fail every 300 hours, which is 12 or 13 days. And that's the amount of time many of these things are rated fordisk drives, power supplies, and so on.
Maisel: Speaking of power supplies, what about power and cooling issues?
Barth: That's definitely a consideration. Our CFD lab had to have power pulled for the last system we built.
Schulz: And you can't go much above 16 or 32 nodes in a closet or rack. After that, you might need a cooled room, dedicated power, a raised floor. As clock speeds on the chips go up, the amount of heat we have to deal with increases exponentially. We see this at TACC all the time.
Barth: In fact, one advantage I can see of having a smaller system in your own lab versus going in on a larger system in a place like TACC might be the speed of your connection to the machine. But that's something that is being rapidly overcome as more universities install fast networks. Another advantage of the smaller lab cluster might be the access level
Maisel: Meaning?
Schulz: You can kill your buddy's job if you're more important!
Barth: Precisely. You can directly prioritize tasks. If something needs to be run for a paper due this week and someone else is toying around with some problem, you can kick them off until your runs are done. On the other hand, we do try to accommodate such priorities at TACC. We can make reservations for things, but we need to know in advance.
Maisel: What is the barrier to on-demand computing in a large cluster?
Barth: The major barrier is the lack, on the hardware and OS side, of reasonable facilities for checkpointing and restarting codes. The only really good checkpoint/restart facilities in my memory were on the Cray T3E. The operating system could do this without even talking to the running application. It could just take a snapshot of its memory, and it knew about any messages that were in flight on the network. You could literally migrate jobs around in the 3D torus topology of the Cray network. If some area of the machine was having a problem, you could migrate running jobs: checkpoint them, move them to another area, and restart them. The runtime was a little longer than it would have been otherwise, but you could take a misbehaving portion of a machine out of the loop without killing jobs.
Schulz: Nothing runs that way on the commodity clusters. We certainly hope checkpoint/restart will be solved in the cluster world at some point; that lack is one of the big downsides to large clusters. Since lots of things will fail, someone's job will die with it, and at the moment, such a job is not recoverable. Of course, in the scientific community, because people are used to this, most people checkpoint their own codedump a file now and then so everything is not lost when something on the cluster fails. That's rudimentary checkpointing, thoughnothing like monitoring what messages are in flight. There are just too many players involved in the cluster community, too many vendors: with the chip produced here, the operating system there, the interconnect by someone elseno wonder checkpoint/restart hasn't been available!
Barth: And no wonder Cray could do it. Cray wrote the OS, wrote the MPI, wrote the other communication libraries; they built the interconnect and the processors and tested them all together and developed them all together. Whereas you have any number of Linux vendors and any number of MPI flavors for clusters.
Maisel: So there's a lot to think about when deciding on buying a cluster?
Schulz: Should you buy your own or should you seek professional help? Wrangler has buy-in from groups of academic researchers. Why would they share one when they can buy their own and do whatever they want with it? But it is actually pretty attractive to go in with your compatriots to buy a bigger machine. Suppose you have X dollars and you go get the largest cluster you can afford, say 16 processors. At max, then, you can run a 16-processor job 24 hours a day. But if you go in with others and buy a larger cluster, you can run a bigger job and get it turned around faster.
Add to that the fact that, in the academic research world, you and your research group are not always in the stage of running production computing jobs. You're developing new algorithms or trying new ideas, so you aren't running all the time. If it's your own cluster, that time is just gone, evaporated. You never get it back. But if you're in a pool with some fixed amount of CPU time on a larger cluster, you can actually get more throughput and waste fewer hours.
Barth: Wrangler does have a big advantage in better memory access. Each processor in a node on Wrangler has access to the entire 4 gigabyte node memory, and the frontside bus runs at 800 MHz, while the Lonestar bus is 533 MHz. For jobs that are memory-bandwidth-limited, that could work out to a 33 percent faster run on Wrangler.
Schulz: The only problem is that big clusters require some expertise. Bill and I got involved as graduate students–we were sort of forced into it–which is how most people pick up these skills. But there's some threshold where student system management makes no sense at all, as you go from 16 to 1000 processors. If you pool your money with another group and someone like TACC, for example, will help administer it, a large cluster can be a big win for all the researchers and all the students, too. Stuff does go wrong, and it may not be obvious that there's a bad line card in a switch or that the bisection bandwidth is a tenth of what it should be–no lights will blink, no e-mail will be sent to you, and it will take a professional to troubleshoot.
Maisel: Is the ability to troubleshoot–this kind of tacit knowledge–something that can be taught formally in engineering or computer science?
Barth: Not nearly enough, I think. At TACC, we have taught a half-day beginning cluster installation course as part of our Introduction to Parallel Programming course, which we give regularly, several times a year. We do know, from our consulting work, that graduate students just beginning to use clusters do not yet have the skills to be effective users.
Schulz: Not because they're not very intelligent, but because there's no class they can take that gives them the Unix skills. You can get training for the Cray X-1 and get certified; I've even heard of general-purpose Linux certification, but I don't know about a general, multi-vendor cluster certification. People start like we did, with a 16-processor cluster. Then they think, wow, 16 is cool, but think what we could do with 64! You learn, and you get burned–we do even now, and we've done it a long time. Clusters are the workingman's supercomputer. Is it an interconnect issue or a disk problem? Like an old hot rod enthusiast, you just go at. It has an awesome engine, but it needs a brake job pretty frequently.
Barth: A student has to avoid becoming a system administrator. Heck, I barely got my research done and my thesis written. A student who decides to improve the department's computational capability, or who is asked to do so, will definitely hear this: “Stop doing that! You need to do your research! Stop working on the system and do your research!”
Maisel: “But how can I do my research if the system isn't up?”
Barth: Exactly! I think that's the bottom line here. The world of advanced computation changes all the time, and it takes an advanced center to stay ahead of the curve because the expertise that such centers have is their most valuable asset. We sure think so here, and we're focusing all of that on getting the job done for researchers at UT, nationwide on the TeraGrid because we're a partner in that, and worldwide through our international alliances.
Merry joined TACC as a Science Writer in October 2003. She speaks with users of TACC resources and summarizes what they do scientifically and computationally, which benefits other users, an audience composed of a scientifically literate public. Merry worked as a science writer at San Diego Supercomputer Center (SDSC) for 18 years before coming to TACC.
For more on Barth and Schulz, including pictures, visit http://www.tacc.utexas.edu/general/staff/
To learn even more about clusters, look into attending the 6th LCI International Conference on Linux Clusters: The HPC Revolution 2005. More info can be found at http://www.linuxclustersinstitute.org/Linux-HPC-Revolution/