by Tom Woolf
San Diego, CA — Some say the battle is brewing in high-performance computing (HPC). Which approach provides better computing performance – cache-coherent, shared-memory systems or compute clusters? As with most things, the answer is not a simple one and can best be summed up as, “It depends.”
Shared-memory systems are those “big-iron” computers with multiple processors working together to deliver lots of data-crunching horsepower. The fact that these systems are cache-coherent and in a single-system image (SSI) is important, since fast access to a large memory reserve is what gives them their power. Clusters, on the other hand, are arrays of standalone systems connected by high-speed networks that force the user to break up computing jobs into separate tasks that can be spread across the systems in the array using a programming model called message passing. Clusters can be scaled to handle larger and larger data processing jobs by adding more nodes. Each approach has its advantages and limitations, depending on the application.
Clustering has been gaining momentum as it becomes more cost/performance competitive. According to Derek Robb, product manager for the Scalable Node Itanium, the next-generation scalable node system based on the Intel microprocessor, “The cost of computer components is dropping, as is the cost of the interconnect, which means the price/performance of compute clusters is more attractive. In addition, more users and independent software vendors are writing applications that accommodate message passing in their programming models so they will run on both clusters and shared-memory systems, and they are changing their algorithms to adapt to a distributed memory model.”
However, clustering is not a magic bullet for HPC. As Ken Jacobsen, director of applications for SGI, notes, “Clustering works best for applications such as animation programs where you render hundreds of similar images to generate film frames.” Shared-memory applications, on the other hand, are designed to draw from a single memory pool and usually don’t have message-passing capability, so they can’t run in a clustering environment. In practice this means that for large computing projects, such as aerodynamic modeling or fluid dynamics, shared-memory offers a better alternative.
According to Jacobsen, the rule is if you write your own application code, then you are in a better position to rewrite it in a message passing model. In practice, this means clustering is better suited to scientific applications, where scientists customize their applications or use open source, as opposed to manufacturing environments where companies use off-the-shelf software written for the lowest common platform.
As Ben Passarelli, director of marketing for scalable servers, explains it, the customer who wants to focus on “science rather than computer science” often prefers shared-memory systems.
Jacobsen adds that while more third-party developers are starting to add message-passing structures to their applications, most commercial developers still need to accommodate multiple operating systems, which makes clustering support difficult. Rewriting commercial applications to add message passing is not trivial, so commercially viable clustering software will be slow in coming.
What this means for SGI is that the company will continue to offer both solutions to customers. In fact, clustering shared-memory systems opens up new market possibilities.
“We know that reducing memory access time is desirable, but when is it better to share everything or segment memory to process different jobs? Both solutions are important to different kinds of customers, and since we are in the business of meeting the needs of the technical community, we will continue to supply both solutions,” said Passarelli. “In fact, both architectures are converging. In the not-too-distant future, SGI software and hardware will be able to bring together the best of both worlds into a single HPC platform.”
The Yin and Yang of HPC: A Debate on the Pros and Cons of Capability Clusters and Cache-Coherent Shared-Memory Systems.
The preceding discussion provides a high-level, uncomplicated, and therefore wholly inadequate view of the debate that is raging as to the “right” way to approach high-performance computing. Like Macintosh versus IBM or rocky road versus tutti-frutti, this debate can have metaphysical ramifications in certain quarters. The following discussion is offered as a more complete view of the debate for the technically uninitiated, to raise awareness as to the whys and wherefores of cache-coherent shared-memory systems and capacity and capability clusters, just in case the subject should arise at your next encounter at the coffee machine.
A cluster is a parallel or distributed system whereby you interconnect a collection of separate computers into a single, unified computing resource. There are two basic kinds of clusters: a capacity or throughput cluster, where different jobs are run batch-style on different systems, and a capability cluster, which uses multiple systems to address huge computing problems. In a capability cluster, information is shared with other nodes using a message-passing protocol over high-speed links, such as HIPPI (high-performance parallel interface), GSNTM (Gigabyte System Network), Gigabit Ethernet, Myrinet, or Giganet.
The objective behind clustering is to take large computing jobs and break them into smaller tasks that can run and communicate effectively across multiple systems. In general, clusters are viewed as superior because they have lower initial cost and can be scaled to large numbers of processors. And since processing is shared among multiple systems, there is no single point of failure.
Much of SGI’s recent development and marketing efforts have focused on the price/performance offered by computer clustering. When measured in terms of dollars-per-megaflop, the cost of proprietary computing hardware continues to drop, at the same time the power of less expensive commodity hardware continues to increase. As a result, price/performance of clustering has dropped substantially, making it an attractive HPC approach for many SGI customers.
SGI recently announced the Advanced Cluster Environment (ACE), which offers an economical clustering solution for both IRIX/MIPS and Linux/IA platforms, leveraging the SGI 2100 and SGITM 2200 midrange computer systems for cost-effective, compute-intensive applications. The SGI IRIX ACE software is designed to complement the SGI 2100 and SGI 2200 midrange servers and draws on expertise developing implementations such as the 1,536-processor cluster for the National Center for Supercomputing Applications (NCSA) and the 6,144-processor cluster for Los Alamos National Labs (LANL). And the new product lines, SGITM 1200 and SGITM 1400, make clustering even more affordable. The pending release of the new server products built on the ItaniumTM processor will push price/performance even further by leveraging high-volume, commodity components from Intel, making clustering even more affordable as the costs of processors drop, interconnect bandwidth increases, and associated latency continues to drop.
“It would be ideal if, instead of a cluster, you could use an SSI shared-memory system of any size you want,” says Robb, “but that’s not economically or technologically feasible – you can’t scale the operating system to thousands of processors.” Robb adds that whereas the practical limit today for shared-memory systems is 128 processors (although a few 256-processor systems have been developed for special applications), clusters can continue to scale up as needed.
Robb indicates that as computing platforms and high-performance interconnects become less expensive, clustering becomes even more attractive for HPC applications. In terms of processing costs, the cost of a Linux cluster today is about $5 per megaflop as opposed to hundreds of dollars per megaflop just a few years ago. With more software development work being done on Linux for clustered nodes built using Intel Pentium, IA-32, and IA-64 processors, costs will continue to drop, making clustering even more attractive for HPC applications.
According to Jacobsen, cache-coherent, shared-memory systems, i.e., computer systems where multiple processors are configured in the same machine with a single memory resource, often deliver superior computing performance because they minimize latency, the lag time created by passing data from one point to another for processing.
“We once performed a test where we ran the same Fluent program on a cluster of four machines with four processors each and a 16-processor single-system image machine,” Jacobsen says. “We found that the 16-processor SSI machine gave performance superior to the clustered systems. The only difference was latency.” The close proximity of processors in the same machine, sharing the same memory, speeds performance because it minimizes latency.
In addition to latency, Jacobsen argues that the total cost of computing is dramatically less with a shared memory solution when matched processor-for-processor. Consider, for example, the cost of administering four shared-memory machines with 64 CPUs per machine in a single cluster, as opposed to administering 64 machines with 4 CPUs per machine. The total number of processors in the cluster is 256 in either configuration, but it is clearly easier to manage and troubleshoot four interconnected systems than it is 64 systems.
Cache-coherent, shared-memory applications also are easier to engineer since they draw from common memory; clustered applications have to use an MPI (message passing interface) to coordinate the data exchange between nodes. The MPI serves as the traffic cop that keeps track of the data, which makes the task of pointing to the data more complicated for the programmer. If an application has message passing built into its architecture it can be readily used in either a clustering environment or a shared-memory system. Applications written for a shared-memory system, however, typically do not incorporate message passing and will only run on shared-memory systems.
To highlight the pros and cons of clustered and shared-memory computing, let’s consider a market segment that has become important to SGI – automotive engineering. In the automotive world there are applications that can be categorized as “embarrassingly parallel,” such as running crash test simulations on the same auto body design using minor variations. For this application, a clustered system is practical, since each simulation can be run on a different node using slightly different parameters. However, other computer-aided engineering applications must run within a fixed time frame using off-the-shelf applications and are better suited to shared-memory systems to keep to the production schedule. Few commercial applications have MPI built in to take advantage of message passing in a clustered environment, so the fallback computing platform has to be a shared memory system.
Both Robb and Jacobsen agree that SGI customers ultimately will embrace both architectures, deploying shared-memory systems into a larger clustered infrastructure. As Jacobsen notes, an architecture with fewer clustered machines minimizes latency and administration, but by putting shared-memory systems in a compute cluster, you have the best of both worlds – a scalable HPC architecture. Robb adds, “We have to embrace both architectures and make intelligent choices about how to combine them to meet customers’ changing needs.”
Adds Passarelli, “Our customers look to SGI to deliver cost-competitive hardware that has no limits on scalability, is easy to administer, and can be integrated into a single comprehensive solution. They want computing performance without having to worry about the underlying configuration. That’s why SGI is actively working to bring together shared-memory systems and clustering into a single platform. We are committed to meeting the high-performance computing needs for all of our customers, and to do that, we need to continue to actively expand the technology for both shared-memory systems and clustered computing.”
So there is no right way or wrong way to approach HPC. Rocky road or tutti-frutti, clustering or shared-memory systems, Linux or IRIX, or an HPC sundae that includes a little bit of everything – customers can always pick the computing combination to suit their taste.
============================================================