by Alan Beck, editor in chief LIVEwire
Dallas, Texas — Thomas Sterling holds a joint appointment with NASA’s Jet Propulsion Laboratory (JPL) and the California Institute of Technology (CalTech), serving as Principle Scientist in JPL’s High Performance Computing group and Faculty Associate in CalTech’s Center for Advanced Computing Research. He received his Ph.D. as a Hertz Fellow from MIT in 1984.
For the last 20 years, Sterling has engaged in applied research in parallel processing hardware and software systems for high-performance computing. He was a developer of the Concert shared-memory multiprocessor, the YARC static dataflow computer, and the Associative Template Dataflow computer concept, and has conducted extensive studies of distributed shared-memory cache-coherent systems. In 1994, Sterling led the NASA Goddard Space Flight Center team that developed the first Beowulf-class PC clusters. Since 1994, he has been a leader in the national Petaflops initiative. He is the Principal Investigator for the interdisciplinary Hybrid Technology Multithreaded (HTMT) architecture research project sponsored by NASA, NSA, NSF, and DARPA, which involves a collaboration of more than a dozen cooperating research institutions. Dr. Sterling holds six patents, and was one of the winners of the 1997 Gordon Bell Prize for Price/Performance.
Sterling gave a state of the field talk on COTS Cluster Systems for High-Performance Computing at SC2000; HPCwire talked with him to obtain a better perspective on his views:
HPCwire: Your work in clustered supercomputing has literally revolutionized HPC in the last few years. But surely there is a limit to what is possible for this type of technology — or is there? What are the most serious factors currently circumscribing the capabilities of clustered HPC? Are any solutions on the horizon?
STERLING: The rate of growth in numbers, scale, and diversity of the implementation and application of clusters in HPC including (but not limited to) Beowulf-class systems has been extraordinary. But my work with Don Becker on the early Beowulf systems succeeded in no small part because of much previous and continuing good work accomplished by many others in the distributed computing community in hardware and software systems. Workstation clusters (e.g. COW, NOW), message-passing libraries (e.g. PVM, MPI), operating systems (e.g. BSD, Linux), middleware (e.g. Condor, Maui Scheduler, PBS, the Scyld scalable cluster distribution), and advanced networking (e.g. Myrinet, QSW, cLAN) are only a few examples of the ideas, experiences, and components that contributed to the synthesis of Beowulf-class PC clusters and continue to push cluster computing forward at an accelerating rate. And driving all of that enabling technology is the computational scientists adopting their distributed application algorithms to the not always friendly operational properties of successive generations of Beowulf platforms.
It has been a pleasure to play a role in the Beowulf phenomenon but it is the accomplishment of many, not just a few. Many government organizations have contributed to this including a number of NASA and DOE labs with valuable tools disseminated to the community as open source software by some of them (e.g. Argonne, Oakridge, Goddard, Ames). And this is being paralleled by the more recent work in large NT-based clusters of PCs as well (e.g. at NCSA, CTC, UCSD). Of course now the field of Beowulf computing has matured such that it is partnered with industry, large and small, in hardware (e.g. Compaq, IBM, VA Linux, HPTI, Microway) and software (e.g. Turbo Linux, Redhat, SuSE, Scyld) providing improved functionality, performance, and robustness at reasonable (usually) cost. As a result, many tasks in academia, industry, government, and commerce are now performed on this class of systems providing a stable architecture family for both ISV and applications programmers to target with confidence while riding the Moore wave through future generations of advanced technology. Indeed, many of our computer science students have their first experiences with parallel computing on small Beowulfs.
How far clusters in general and Beowulf-class systems in particular can go is a tantalizing question. The challenges today may be seen in three dimensions: 1) bandwidth and latency of communications, 2) usability and generality of system environments, and 3) availability and robustness for industrial grade operation. The first is now being addressed by industry perhaps starting with the pathfinding work of Chuck Seitz with Myrinet. Improvements in both latency and bandwidth by one and two orders of magnitude over the baseline Fast-Ethernet LAN are being achieved with such consortium drivers as VIA and Infiniband. Bandwidths beyond 10 Gbps and real latencies approaching a microsecond are on the horizon as zero-copy software and optical channels become mainstream for future system area networks.
A number of groups in the US, Japan, and Europe are developing tools to establish acceptable environments for managing, administering, and applying these systems to real-world workloads. This will take some time to shake out, although significant progress is finally being made. Various efforts to collect representative tools in to usable distributions (e.g. Oscar, Scyld, Grendel) and make them available are involving collaborations across many institutions. While such systems may never be easy to program or truly transparent or seamless in their supervision, they may prove sufficient within the bounds of practical necessity.
Finally, the issue of reliability is one that appears to vary dramatically. One hears horror stories of nodes dying every few hours and others where complete systems stay up for more than half a year. At Caltech our Beowulf “Naegling” has had a worst case node failure within 80 days and a best failure time of almost 200 days. This is after surviving the usual burn-in period. Infant mortality is always part of the experience and certain types of components (e.g. fans, power supplies, disks, NICs) tend to experience some fatality within the first few weeks. Then the systems stabilize. A similar process occurs with the software environments; bugs in the installation and configuration are exposed early on and have to be eliminated one by one, sometimes painfully. But industry investment in the mass market nodes and networks and their recent efforts in system integration are showing results in improving availability and robustness. More work is needed in limiting the down time of a system when an individual component dies. There are severe challenges in even detecting when wrong results are produced, although a system keeps running. These are expected to receive increasing attention as a real market, especially in commerce, is found for systems as large as thousands of processors.
We are approaching the milestone (albeit somewhat arbitrary) of being able to assemble a Teraflops-scale Beowulf-class system for one million dollars. But the cost of running and maintaining such a system is non-trivial and has to be accounted for. And industry (e.g. Sun, Compaq, SGI, IBM) is playing an increasingly important role in making such systems accessible. Another area that is lagging is that of distributed mass storage and generalized parallel file servers. Systems oriented around the storage and fetching of mass data sets is likely to drive the commercial customer base for clusters and play an important role in scientific computing as well. While some early systems are being employed (e.g. PPFS, PVFS), much work has yet to be done in this area. With system on a chip (SOC) technology allowing multiple processors and their integrated caches to be implemented on a single die and clock rates slowly increasing through the single-digit GHz regime, performance density is likely to continue to advance at a steady pace. Will we see a Petaflops Beowulf by 2010 as possibly implied by the Top500 list? It is not out of the question, although personally I hope we find a better way. Beowulf was always about picking the low hanging fruit and has consistently shown that where there is a way, there is a will.
HPCwire: Within the last year several firms have emerged that are solely focused upon exploiting computing power from large networks of Internet-connected PCs. How do you view these efforts? What will ultimately determine the success or failure of such ventures?
STERLING: This is a new frontier in distributed computing and one based on the perceived opportunity of an untried business model. What I call “cottage computing” is unique and has no analogue in other domains of economy or production (that I can think of) since the beginnings of the industrial revolution in the mid 18th century. The seti@home experience is tantalizing and stimulates consideration of broader application that is driving these new enterprises. But I am extremely uncertain of the outcome. It will ultimately be determined by the complex interplay of factors including the difficulties of achieving adequate security in both directions, the relative value of diffuse computing cycles, and the competing alternatives. While I am not yet convinced of a favorable outcome, this is an exciting process with some very sharp people heavily engaged. Its evolution will be very interesting to watch over the next 18 months.
HPCwire: As Principal Investigator for the interdisciplinary Hybrid Technology Multithreaded (HTMT) architecture research project, you have a unique insight into the characteristics of these fascinating technologies. Please share some of your thoughts and observations with us.
STERLING: The multi-institution, interdisciplinary HTMT architecture research project is a four-year effort to explore a synthesis of alternative technologies and architectural structures to enable practical general-purpose computing in the trans-Petaflops regime. The genesis of this advanced exploratory investigation was catalyzed by the initial findings of the National Petaflops Initiative, a community-wide process, and is aligned with the strong recommendations of the President’s Information Technology Advisory Committee (PITAC) on high performance computing research directions. Through HTMT, significant insights have been acquired revealing the potential of aggressively exploiting non-conventional strategies to achieving ultra-scale performance. Perhaps the most important was the value of inter-relating system structure and disparate technologies to accomplish a synergy of complementing technology characteristics. Much of the public attention and controversy has been on the technologies themselves, which pushed the capability of logic speed, storage capacity, and communications throughput to extremes.
While the project was not committed to any particular device, it studied specific example technologies in detail, in some cases contributing to their advancement. Among these, the innovative packet switched Data Vortex optical network exploiting both time division and wave division multiplexing may have near term impact for a wide range of high-end systems. Optical holographic storage was shown to provide one possible means of providing a high density, high throughput memory layer between primary and secondary storage. The merger of semiconductor DRAM cell and CMOS logic was shown to enable Processor in Memory (PIM) smart memory structures that may make possible new relationships between high speed computer processors and their memory hierarchy imbued with extended functionality. The most controversial aspect of the project was its investigation of superconductor rapid single flux quantum (RSFQ) logic. Conventional wisdom dictates that previous experiences by IBM and within Japan demonstrated that computers built from superconductor electronics were infeasible while the cooling requirements made it impractical.
The findings of the HTMT project are that Niobium based RSFQ logic is both feasible and practical and affords unique opportunities in the design of very high-speed processor design between 50 GHz and 150 GHz clock rates. However, within the constraints of existing fabrication facilities and industrial/government investment, the likelihood of realizing such components is remote. Even more significant than the technologies within HTMT is the architecture that would incorporate them. HTMT explored the potential of a dynamic adaptive resource management methodology called “percolation” that employs smart memories to determine when tasks are to be performed and to pre-stage all information related to task execution proactively using low cost in-memory logic. The conclusion is that such small processors can remove the combined problems of overhead and latency from the main processors while performing many of the low-locality data intensive operations in the memories themselves. The result would be very high efficient operation even on those algorithms that have proven difficult to optimize in the past. The overall result of the HTMT project is the strong opportunities for increasing investments in high performance computer system research to benefit from significant potential benefits as yet not exploited.
HPCwire: What are the most important issues facing HPC today? What are the best ways those within the community can pursue creative solutions?
STERLING: The dominant strategic issues today are: first, is HPC important, and second, must all future HPC systems be limited to COTS clusters and their equivalents. While the first issue may appear silly to some, there is a real threat to HPC and supercomputing as a goal and discipline with some respected colleagues publicly stating that performance as a research goal is no longer important. This is in part driven by the excitement about the potential of the Internet, Web, and Grids that are perceived as a more attractive and lucrative area of pursuit than HPC systems development. There is also an apparent malaise derived from the perception of a small and shrinking HPC market, the Moore’s law juggernaut, lack of funding, the diminishing glamour, and the poor track record of such research in the past. For this reason, where HPC is really needed, both industry and academia in many cases perceive clusters including, but not limited to, Beowulf-class systems to be an easy, relatively low cost out, with short-term difficulties to be rectified to an adequate degree, it is presumed, by future developments in distributed system software.
Our work with Beowulfs has shown us that in many cases this is an acceptable solution and that the contributions being made by Becker and many others will reduce, although probably not close, the gap between cheap hardware and needed user environments. But my work on HTMT and Petaflops scale computing has revealed both the need for and opportunity of devising innovative new structures for attacking major computational challenges at performance levels orders of magnitude beyond that which is being implemented today. The early work by IBM on its BlueGene project is suggesting the same conclusions. Problems of controlled fusion, molecular protein folding and drug design, high confidence climate modeling, complex aerospace vehicle design optimization, and brain modeling while perhaps not as enticing as real-time video games and e-commerce would nonetheless revolutionize human existence in the 21st Century.
From a technical perspective, the dual challenges of good price-performance for scalable systems and latency management for acceptable system efficiency are matched by the more vague goals of programmability and generality. These are nothing new in the field of parallel processing but their impact is of increasing significance as system scale extends beyond 10,000 coarse-grain nodes (e.g. ASCI White) or even a million fine grain nodes (e.g. IBM BlueGene) and as more complex interdisciplinary applications are pursued. Cost is important and the need to devise structures that can be realized at low cost, other than COTS cluster techniques, is critical. PIM is one very real possibility here but the architectures, while retaining simplicity, must be advanced well beyond current examples.
In my view (and others may disagree), dynamic adaptive resource methodologies, most likely exploiting PIM smart memories, may address the key problems of latency (perhaps through percolation), overhead, and load balancing while simplifying both hardware and software development. But in the long term even as it pains me to say so, I see a need for advanced parallel languages that are not constrained by assumptions of conventional underlying hardware components and organizations. I believe a new decision-tree model for resource management is required; one that revises the notion of what does a computer know and when does it know it in making the determination of resource to task allocation in time and space. These questions are both significant and tantalizing. It only remains to the combined high performance research community to revitalize its commitment to their pursuit and ultimate resolution.
HPCwire: How would you characterize the current interrelationship between national policy, corporate policy, and leading-edge HPC research? Should this be modified? If so, how?
STERLING: It is difficult to characterize “national policy” as it pertains to HPC research. The PITAC recommendations on future directions in HPC research were clear and specific and I adhere to them both in principal and in their explicit proposed actions. These recommendations were not addressed by the Federal agencies for FY01 although many other important areas in IT considered by PITAC did receive attention. There is real interest in many quarters to do so, but at the moment, aggressive pursuit of these ideals remains dormant. Corporate policy quite reasonably focuses on the sweet spot of the market and the Cluster approach lends well to this strategy, providing a degree of scalability without investing in unique systems for the high end. The risks are too high for industry alone to attack them while the perception is that the market is too small to provide adequate financial return. My guess is that the latent market is much greater but not at the price-performance point of the older supercomputers and MPPs. Of course, many applications today routinely run at performance levels on the desktop that supercomputer applications consumed a decade ago. That should be a strong signal that the opportunities for much greater performance systems are plenty. However, the community either has not gotten the hint or rather they use the same experience to justify waiting: Petaflops will come to those who wait; Moore or less.
From my previous comments, yes, I believe the apparent policies and interrelationships should be modified. The partnership between national policy and corporate policy in HPC research should be one of mutual and complementing strengths. The DOE ASCI program performed well in working with industry to development pace setting high performance systems through the extension of conventional means and in so doing demonstrated the value of advanced capability systems for exploring the frontiers of science and technology through computation. But no counter balancing non-incremental advanced research of significance has been undertaken or sponsored to explore over-the-horizon regimes. Yes, quantum computing and other exotic forms of processing are being supported under basic research. But there is a major gap between these and today’s conventional distributed systems. I would like to see the PITAC recommendations in HPC carried out and that a partnering between industry and government be developed involving that academic community to explore innovative opportunities and reduce the risk so that a truly new class of parallel computer system can emerge to escape the current cul-de-sac in which we are trapped and deliver a revolutionary new tool with which to build the world habitat of the 21st century.
============================================================